Day 17: Newton's Method
Phase II — Unconstrained Optimization | Week 3 | 2.5 hours
Quadratic convergence: the gold standard when you can afford the Hessian.
Navigation
OKS Relevance
Ceres uses Gauss-Newton (an approximation of Newton) as its default inner loop. Newton's method on the full problem $\min f(x)$ is the theoretical ideal — quadratic convergence means ~5 iterations vs ~10,000 for GD. The catch: you need the Hessian ($O(n^2)$ storage, $O(n^3)$ solve). For SLAM with 100,000 variables, that's impossible — which is why we'll learn sparse and approximate alternatives.
Theory (45 min)
17.1 The Newton Step
Minimize the second-order Taylor expansion at $x_k$:
$$m_k(\delta) = f_k + \nabla f_k^T \delta + \frac{1}{2} \delta^T H_k \delta$$
Setting $\nabla_\delta m_k = 0$:
$$H_k \delta = -\nabla f_k \quad \Rightarrow \quad x_{k+1} = x_k - H_k^{-1} \nabla f_k$$
17.2 Quadratic Convergence
Near a minimum $x^*$ where $H(x^*) \succ 0$:
$$\|x_{k+1} - x^*\| \leq C \|x_k - x^*\|^2$$
Example: If $\|x_0 - x^*\| = 0.1$, then errors go: $0.1, 0.01, 0.0001, 10^{-8}, 10^{-16}$ — just 5 iterations to machine precision.
17.3 Derivation for Quadratics
For $f(x) = \frac{1}{2}x^TAx - b^Tx$: $\nabla f = Ax - b$, $H = A$.
Newton: $x_1 = x_0 - A^{-1}(Ax_0 - b) = A^{-1}b = x^*$.
Newton converges in exactly 1 step on any quadratic — because the Taylor model is exact.
17.4 When Newton Fails
- Hessian not PD: Step may ascend (maximize instead of minimize)
- Far from solution: Quadratic model is poor, step overshoots
- Cost: $O(n^2)$ to store $H$, $O(n^3)$ to solve $H\delta = -g$ — prohibitive for large $n$
- Saddle points: Newton converges to saddle points as readily as to minima
17.5 Modified Newton
When $H$ is not PD, add regularization:
$$(\bar{H} + \mu I)\delta = -\nabla f$$
Choose $\mu$ to make $\bar{H} + \mu I \succ 0$. This is a damped Newton step — blends between Newton (small $\mu$) and gradient descent (large $\mu$).
17.6 Newton with Line Search
To handle poor initial points:
$$x_{k+1} = x_k + \alpha_k \delta_k$$
where $\delta_k$ is the Newton direction and $\alpha_k$ satisfies Wolfe conditions. Ensures global convergence while preserving quadratic convergence near the solution.
Implementation (60 min)
Exercise 1: Pure Newton
# See: code/week03/newtons_method.py
Implement Newton's method with analytic Hessian.
Exercise 2: Rosenbrock Test
Test on Rosenbrock: count iterations (~10-20 vs GD's ~10,000). Show quadratic convergence by plotting $\log\|x_k - x^*\|$ vs iteration.
Exercise 3: Quadratic Convergence Verification
Plot $\log(\text{error}_k)$ vs $k$ — the slope should double each step (characteristic of order-2 convergence).
Exercise 4: Failure Case
Start far from minimum where Hessian is indefinite. Show pure Newton diverges, but damped Newton (with line search) works.
Exercise 5: Modified Newton
Implement $H + \mu I$ regularization. When $H$ has negative eigenvalues, shift them to be positive. Show it stabilizes the divergent case.
Practice (45 min)
- Derive: Newton's method on $f(x) = ax^2 + bx + c$ converges in 1 step. Show it.
- When does Newton fail? List 3 scenarios with concrete examples.
- For $f(x,y) = x^4 + y^4$, compute the Hessian at $(0,0)$. Why does Newton struggle there?
- Compare per-iteration cost: GD is $O(n)$, Newton is $O(n^3)$. At what $n$ does Newton become impractical?
- For Rosenbrock at $(-1, 1)$: compute $H$, check if PD, compute Newton step.
- What is the relationship between Newton's method and the inverse function theorem?
- If the Hessian is diagonal, what is the Newton step? (Element-wise $g_i / H_{ii}$.)
- Why does damped Newton ($\alpha < 1$) help? Sketch the 1D case.
Expert Challenges
🎯 Challenge 1: Newton's method on a neural network loss
**Problem:** Consider a 2-layer network with 10 parameters. Compute the full Hessian and run Newton. Compare with GD + Adam. Newton should converge in very few steps — but what's the catch for networks with 10M parameters?
**Solution:** For 10 parameters, Newton converges in ~5-10 iterations while Adam takes ~100s. But for $n = 10^7$: $H$ is $10^7 \times 10^7 = 10^{14}$ entries = 800 TB. Solving $H\delta = -g$ is $O(n^3) = O(10^{21})$ FLOPs. This is why deep learning uses first-order methods (SGD, Adam) — Newton is information-optimal but computationally infeasible at scale.
🎯 Challenge 2: Cubic regularization of Newton
**Problem:** Instead of the quadratic model, use $m(\delta) = f + g^T\delta + \frac{1}{2}\delta^T H\delta + \frac{\sigma}{3}\|\delta\|^3$. Show that this method has $O(1/k^2)$ global convergence rate (better than GD's $O(1/k)$) and naturally handles indefinite Hessians.
**Solution:** The cubic term penalizes large steps, so even if $H$ is indefinite, the model has a unique minimum. The subproblem reduces to $(H + \lambda I)\delta = -g$ where $\lambda = \sigma\|\delta\|$ — this is the Levenberg-Marquardt form! The $O(1/k^2)$ rate was proven by Nesterov & Polyak (2006). The method is "ARC" (Adaptive Regularization with Cubics). Key insight: the cubic model is a better local approximation than quadratic when far from the solution.
🎯 Challenge 3: Hessian-free Newton (Newton-CG)
**Problem:** Instead of forming $H$ explicitly, use CG to solve $H\delta = -g$ using only Hessian-vector products $Hv$. Implement Newton-CG where $Hv$ is computed via AD. Compare memory: full Newton $O(n^2)$ vs Newton-CG $O(n)$.
**Solution:** $Hv$ can be computed as $\nabla((\nabla f)^T v)$ — a gradient of a scalar, which AD handles in $O(n)$ time. Run CG for $k \ll n$ steps (truncated Newton). Early termination when negative curvature detected: $d^T H d < 0$ → use current CG direction as is. This scales Newton to $n = 10^6+$ variables. Libraries: `jax.jvp` for forward-mode Hv, `scipy.sparse.linalg.LinearOperator` + `cg` for the solve.
Self-Assessment Checklist
- [ ] I can implement Newton's method and verify quadratic convergence
- [ ] I know when Newton fails (indefinite $H$, far from solution, large $n$)
- [ ] I can implement modified Newton with $H + \mu I$ regularization
- [ ] I understand Newton with line search for global convergence
Key Takeaways
- Newton's step is $\delta = -H^{-1}g$ — solves the quadratic model exactly.
- Quadratic convergence $\|e_{k+1}\| \leq C\|e_k\|^2$ — doubles correct digits each step.
- Fails when $H \not\succ 0$ or far from solution — fix with regularization or line search.
- Cost $O(n^3)$ per step — prohibitive for large problems, motivating quasi-Newton (Day 19).