Exercises: Gradients, Hessians, and Convexity

Chapters 01 & 02: Mathematical Foundations & Unconstrained Optimization

Self-assessment guide: Work each problem on paper before expanding the answer. If you can solve 80% without peeking, move on to 02-least-squares-ceres.md.

Section A — Conceptual Questions

A1. Given $f(x, y) = x^2 y + \sin(xy)$, compute the gradient $\nabla f$ and the Hessian $H_f$. At what point(s) is the gradient zero?

Answer

**Gradient:** $$\nabla f = \begin{bmatrix} 2xy + y\cos(xy) \\ x^2 + x\cos(xy) \end{bmatrix}$$ **Hessian:** $$H_f = \begin{bmatrix} 2y - y^2\sin(xy) & 2x + \cos(xy) - xy\sin(xy) \\ 2x + \cos(xy) - xy\sin(xy) & -x^2\sin(xy) \end{bmatrix}$$ At the origin $(0, 0)$: - $\nabla f = (0, 0)$ — it's a stationary point - $H_f = \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}$ with eigenvalues $\pm 1$ — indefinite → saddle point

A2. Prove that $f(x) = \|Ax - b\|^2$ is convex for any matrix $A$ and vector $b$. Under what condition is it strictly convex?

Answer

Expand: $f(x) = x^T A^T A x - 2b^T A x + b^T b$ The Hessian is $H = 2A^T A$. A function is convex iff its Hessian is positive semidefinite everywhere. $A^T A$ is always PSD: for any $v$, $v^T A^T A v = \|Av\|^2 \geq 0$. Therefore $f$ is convex. **Strictly convex** iff $A^T A \succ 0$ (positive definite), which requires $A$ to have full column rank (i.e., $\text{rank}(A) = n$ where $A \in \mathbb{R}^{m \times n}$). This ensures $\|Av\|^2 > 0$ for all $v \neq 0$. If $A$ doesn't have full column rank, there exists $v \neq 0$ with $Av = 0$, so the function is flat in that direction — convex but not strictly convex, and the minimum is not unique.

A3. A colleague says: "Gradient descent always converges to the global minimum if you set the learning rate small enough." Is this correct? Provide a counterexample.

Answer

**Incorrect.** Gradient descent converges to a **local** minimum (or saddle point), not necessarily the global minimum. **Counterexample:** $f(x) = x^4 - 2x^2 + 0.5$ This has two local minima at $x = \pm 1$ and a local maximum at $x = 0$. Starting from $x_0 = -0.5$, gradient descent converges to $x = -1$ regardless of learning rate, missing the global minimum at $x = 1$ (both have the same value here, but in general the left/right minima could differ). For non-convex functions, GD is only guaranteed to find a local minimum (and with small enough step size, it converges to a stationary point, which could even be a saddle point if the direction of negative curvature isn't explored). The statement is true **only for convex functions**, where every local minimum is global.

A4. Why does Newton's method converge in exactly 1 step for a quadratic function? What does this tell you about its behavior near any minimum?

Answer

For a quadratic $f(x) = \frac{1}{2}x^T H x + g^T x + c$: - Gradient: $\nabla f(x) = Hx + g$ - Hessian: $\nabla^2 f(x) = H$ (constant) Newton step: $x_{k+1} = x_k - H^{-1}(Hx_k + g) = -H^{-1}g = x^*$ The Newton step takes you **directly to the minimum** because the second-order Taylor approximation is **exact** for a quadratic. **Near any minimum:** Any smooth function is approximately quadratic near its minimum (Taylor expansion). So Newton's method converges **quadratically** near the minimum — each step roughly doubles the number of correct digits. This is why it converges so fast in the final iterations.

Section B — Computation Problems

B1. Implement gradient descent and Newton's method for the Rosenbrock function:

$$f(x, y) = (1 - x)^2 + 100(y - x^2)^2$$

Compare: - Number of iterations to converge ($\|\nabla f\| < 10^{-6}$) - Path taken (plot the trajectory) - Effect of step size on GD convergence

import numpy as np

def rosenbrock(x):
    """f(x,y) = (1-x)^2 + 100*(y-x^2)^2"""
    return (1 - x[0])**2 + 100 * (x[1] - x[0]**2)**2

def rosenbrock_grad(x):
    """Gradient of Rosenbrock function"""
    # TODO: compute and return [df/dx, df/dy]
    pass

def rosenbrock_hess(x):
    """Hessian of Rosenbrock function"""
    # TODO: compute and return 2x2 Hessian matrix
    pass

def gradient_descent(f, grad, x0, lr=0.001, tol=1e-6, max_iter=100000):
    """
    Returns: (x_final, trajectory, num_iterations)
    """
    # TODO: implement
    pass

def newton_method(f, grad, hess, x0, tol=1e-6, max_iter=100):
    """
    Returns: (x_final, trajectory, num_iterations)
    """
    # TODO: implement with line search for safety
    pass

# Start from (-1, 1)
x0 = np.array([-1.0, 1.0])
# Compare both methods. Expected: GD takes ~10,000+ iterations, Newton ~20.

Solution

import numpy as np

def rosenbrock(x):
    return (1 - x[0])**2 + 100 * (x[1] - x[0]**2)**2

def rosenbrock_grad(x):
    dx = -2*(1 - x[0]) + 200*(x[1] - x[0]**2)*(-2*x[0])
    dy = 200*(x[1] - x[0]**2)
    return np.array([dx, dy])

def rosenbrock_hess(x):
    dxx = 2 - 400*x[1] + 1200*x[0]**2
    dxy = -400*x[0]
    dyy = 200.0
    return np.array([[dxx, dxy], [dxy, dyy]])

def gradient_descent(f, grad, x0, lr=0.001, tol=1e-6, max_iter=100000):
    x = x0.copy()
    trajectory = [x.copy()]
    for i in range(max_iter):
        g = grad(x)
        if np.linalg.norm(g) < tol:
            return x, trajectory, i
        x = x - lr * g
        trajectory.append(x.copy())
    return x, trajectory, max_iter

def newton_method(f, grad, hess, x0, tol=1e-6, max_iter=100):
    x = x0.copy()
    trajectory = [x.copy()]
    for i in range(max_iter):
        g = grad(x)
        if np.linalg.norm(g) < tol:
            return x, trajectory, i
        H = hess(x)
        # Add damping if H is not positive definite
        try:
            dx = np.linalg.solve(H, -g)
        except np.linalg.LinAlgError:
            dx = -g  # fallback to gradient step

        # Backtracking line search
        alpha = 1.0
        while f(x + alpha * dx) > f(x) + 1e-4 * alpha * g.dot(dx):
            alpha *= 0.5
            if alpha < 1e-16:
                break

        x = x + alpha * dx
        trajectory.append(x.copy())
    return x, trajectory, max_iter

x0 = np.array([-1.0, 1.0])

x_gd, traj_gd, n_gd = gradient_descent(rosenbrock, rosenbrock_grad, x0, lr=0.001)
print(f"GD: {n_gd} iterations, x = {x_gd}")

x_nt, traj_nt, n_nt = newton_method(rosenbrock, rosenbrock_grad, rosenbrock_hess, x0)
print(f"Newton: {n_nt} iterations, x = {x_nt}")

Typical results: GD ~10,000+ iterations (with lr=0.001), Newton ~15–25 iterations. The GD trajectory zig-zags along the narrow valley; Newton cuts straight through.

B2. For the function $f(x) = x_1^2 + 10 x_2^2$ (an elongated bowl):

What is the condition number of the Hessian?
How many iterations does gradient descent take vs. Newton?
What happens to GD if you change it to $f(x) = x_1^2 + 1000 x_2^2$?

Answer

1. $H = \begin{bmatrix} 2 & 0 \\ 0 & 20 \end{bmatrix}$. $\kappa(H) = 20/2 = 10$. 2. GD convergence rate: $((\kappa - 1)/(\kappa + 1))^2 = (9/11)^2 \approx 0.67$ per iteration. Need about $\frac{\log(1/\epsilon)}{-\log(0.67)} \approx 35$ iterations for $\epsilon = 10^{-6}$. Newton: 1 iteration (quadratic function). 3. For $f(x) = x_1^2 + 1000 x_2^2$: $\kappa = 1000$. Rate $= (999/1001)^2 \approx 0.996$. Need $\approx 3,450$ iterations. The worse the condition number, the more GD zig-zags. **Lesson:** Condition number directly controls GD convergence speed. Newton is invariant to conditioning.

Section C — OKS Connection

C1. The OKS navigation estimator's EKF state includes $(x, y, \theta, v, \omega)$. The position variables are in metres ($\sim 0$ to $100$), while the heading $\theta$ is in radians ($\sim 0$ to $6.28$).

Why might this cause numerical issues if you were solving an optimization problem with these variables?
How would you rescale? What scaling factors would you choose?
Why does the EKF avoid this issue (mostly)?

Answer

1. The Hessian will have entries of very different magnitudes — position-related entries could be $10^4 \times$ larger than heading-related entries. This leads to a poorly conditioned Hessian, causing slow convergence or inaccurate steps. 2. Scale all variables to $O(1)$: - Position: divide by typical range ($\sim 50$ m) → scaled position $\in [-1, 2]$ - Heading: already $O(1)$, no scaling needed - Velocity: divide by max velocity ($\sim 1$ m/s) → scaled $\in [0, 1]$ - Angular velocity: divide by max ($\sim 1$ rad/s) → already $O(1)$ 3. The EKF processes one measurement at a time, updating the state incrementally. It never solves a large system simultaneously where different-scale variables interact in the same Hessian. The covariance matrix $P$ naturally tracks the different scales. The issue mainly arises in batch optimization where all variables are solved at once.