Day 16: Line Search Methods

Phase II — Unconstrained Optimization | Week 3 | 2.5 hours Choosing the right step size is half the battle.

Previous: Day 15: Gradient Descent
Next: Day 17: Newton's Method
Week: Week 3 Overview
Curriculum: CURRICULUM.md

OKS Relevance

Every Ceres solver iteration uses a line search or trust region to pick the step size. The Wolfe conditions guarantee convergence for BFGS — which is why Ceres requires them internally when using the LINE_SEARCH minimizer. Understanding Armijo vs Wolfe helps you debug cases where the solver stalls: often a line-search failure is the root cause of "solver did not converge."

Theory (45 min)

16.1 Exact Line Search

Given a search direction $d_k$, the ideal step:

$$\alpha^* = \arg\min_{\alpha > 0} f(x_k + \alpha d_k)$$

This is a 1D optimization — solvable for quadratics, but impractical in general.

16.2 Armijo Condition (Sufficient Decrease)

Accept $\alpha$ if:

$$f(x_k + \alpha d_k) \leq f(x_k) + c_1 \alpha \nabla f_k^T d_k$$

where $c_1 \in (0, 1)$, typically $c_1 = 10^{-4}$.

Interpretation: The function must decrease by at least a fraction of the predicted decrease from the linear model.

16.3 Curvature Condition

$$\nabla f(x_k + \alpha d_k)^T d_k \geq c_2 \nabla f_k^T d_k$$

where $c_2 \in (c_1, 1)$, typically $c_2 = 0.9$.

Interpretation: The gradient along $d_k$ must have flattened enough — prevents stopping too early.

16.4 Wolfe Conditions

Wolfe = Armijo + curvature:

$$f(x_k + \alpha d_k) \leq f(x_k) + c_1 \alpha \nabla f_k^T d_k$$ $$\nabla f(x_k + \alpha d_k)^T d_k \geq c_2 \nabla f_k^T d_k$$

Strong Wolfe = Armijo + strong curvature (absolute value):

$$|\nabla f(x_k + \alpha d_k)^T d_k| \leq c_2 |\nabla f_k^T d_k|$$

16.5 Backtracking Algorithm

α ← α₀  (e.g., 1.0)
while f(x + αd) > f(x) + c₁ α ∇f^T d:
    α ← ρ α     (ρ ∈ (0,1), e.g., 0.5)
return α

Simple, cheap, guaranteed to terminate for any descent direction.

16.6 Zoom Algorithm (Nocedal & Wright)

For Wolfe conditions — a more sophisticated two-phase algorithm: 1. Bracket: Find interval $[\alpha_{\text{lo}}, \alpha_{\text{hi}}]$ containing an acceptable step 2. Zoom: Bisect or interpolate within the bracket until Wolfe conditions hold

16.7 Why Wolfe Matters

Armijo alone is sufficient for GD convergence
Wolfe conditions are needed for BFGS convergence (secant condition requires curvature)
Strong Wolfe prevents pathological cases in conjugate gradient

Implementation (60 min)

Exercise 1: Backtracking Line Search

# See: code/week03/line_search.py

Implement backtracking with Armijo condition. Test parameter sensitivity ($c_1$, $\rho$, $\alpha_0$).

Exercise 2: Wolfe Line Search

Implement the zoom algorithm from Nocedal & Wright, Algorithm 3.5/3.6.

Exercise 3: Comparison

Compare fixed step vs Armijo vs Wolfe on Rosenbrock.

Exercise 4: Pathological Case

Show where Armijo alone gives bad step sizes (too short), but Wolfe corrects it.

Exercise 5: Visualization

For a 1D slice along the search direction: plot $\phi(\alpha) = f(x + \alpha d)$ with Armijo/Wolfe acceptance regions marked.

Practice (45 min)

Prove: Armijo + backtracking always terminates given a descent direction and $f$ bounded below.
Parameter sensitivity: What happens with $c_1 = 0.5$ (too strict) vs $c_1 = 10^{-8}$ (too loose)?
Show that $c_1 < c_2$ is required for the Wolfe conditions to be satisfiable.
For quadratic $f(x) = \frac{1}{2}x^TAx$, derive the exact line search step: $\alpha^* = \frac{g^Tg}{g^TAg}$.
Why does backtracking start with $\alpha = 1$ for Newton's method? (Because Newton's step is unit-length optimal.)
How many function evaluations does backtracking need per iteration? (1 + number of shrinks.)
Wolfe line search needs gradient evaluations — is it worth the cost?
Draw the 1D plot showing acceptable Armijo region, Wolfe region, and strong Wolfe region.

Expert Challenges

🎯 Challenge 1: Quadratic interpolation line search

**Problem:** Instead of halving $\alpha$, use the current function values to fit a quadratic and find its minimum. Implement this and compare convergence speed vs simple backtracking. **Solution:** Given $\phi(0) = f(x_k)$, $\phi'(0) = \nabla f_k^T d_k$, and $\phi(\alpha_0) = f(x_k + \alpha_0 d_k)$, fit quadratic $\phi(\alpha) \approx a\alpha^2 + b\alpha + c$. Then $\alpha_{\text{new}} = -b/(2a) = -\phi'(0) \alpha_0^2 / (2(\phi(\alpha_0) - \phi(0) - \phi'(0)\alpha_0))$. Safeguard: ensure $\alpha_{\text{new}} \in [0.1\alpha_0, 0.5\alpha_0]$. This typically needs fewer iterations than simple halving because it uses curvature information.

🎯 Challenge 2: Goldstein conditions

**Problem:** An alternative to Wolfe: the **Goldstein conditions** require $f(x + \alpha d)$ to lie between two linear bounds. State them and implement. When are they preferred over Wolfe? **Solution:** Goldstein conditions: $(1 - c) \alpha \nabla f^T d \leq f(x + \alpha d) - f(x) \leq c \alpha \nabla f^T d$ with $0 < c < 1/2$. The lower bound prevents $\alpha$ from being too small, the upper bound (Armijo) prevents too large. Preferred for methods that don't need curvature info (GD, trust region). Not suitable for BFGS/CG because they don't guarantee the curvature condition needed for the secant update.

🎯 Challenge 3: Line search failure detection

**Problem:** In production code (Ceres, scipy), what happens when line search fails? Implement a line search that gracefully handles: (a) $f$ returning NaN/Inf, (b) gradient returning NaN, (c) step size underflow ($\alpha < 10^{-16}$). Return appropriate diagnostics. **Solution:** Add guards: if $f(x + \alpha d)$ is NaN/Inf → shrink $\alpha$ aggressively ($\rho = 0.1$ instead of 0.5). If $\alpha < \alpha_{\min}$ → declare line search failure, return the current $x$ with a flag. If gradient is NaN → fall back to function-value-only line search (no curvature condition). Ceres does this internally — check `LineSearchMinimizer::DoLineSearch()` source. Scipy raises `LineSearchWarning`.

Self-Assessment Checklist

[ ] I can implement backtracking (Armijo) and Wolfe line searches
[ ] I understand why Wolfe conditions are needed for BFGS convergence
[ ] I know the zoom algorithm structure for strong Wolfe
[ ] I can visualize and interpret line search acceptance regions

Key Takeaways

Backtracking (Armijo) is the simplest line search — guaranteed to terminate, sufficient for GD.
Wolfe conditions add a curvature requirement — needed for BFGS and CG convergence guarantees.
Strong Wolfe = Armijo + $|g^Td| \leq c_2|g_0^Td|$ — most restrictive, best guarantees.
Zoom algorithm is the standard implementation for Wolfe — see Nocedal & Wright §3.5.