Goal: Understand the core algorithms that find minima of smooth functions without constraints — the building blocks for everything in chapters 03–07.
$$\min_{x \in \mathbb{R}^n} f(x)$$
No constraints on $x$. The function $f$ is assumed smooth (at least twice differentiable).
Strategy: Generate a sequence $x_0, x_1, x_2, \ldots$ that converges to a local minimizer $x^*$.
Every method follows this template:
x_{k+1} = x_k + α_k · d_k
where $d_k$ is the search direction and $\alpha_k$ is the step size (learning rate).
The simplest choice: move in the steepest descent direction.
$$x_{k+1} = x_k - \alpha_k \nabla f(x_k)$$
def gradient_descent(f, grad_f, x0, alpha=0.01, tol=1e-6, max_iter=10000):
x = x0.copy()
for k in range(max_iter):
g = grad_f(x)
if np.linalg.norm(g) < tol:
break
x = x - alpha * g
return x, k
| Strategy | Formula | Pros | Cons |
|---|---|---|---|
| Fixed | $\alpha_k = \alpha$ (constant) | Simple | Must tune; too large → diverge, too small → crawl |
| Exact line search | $\alpha_k = \arg\min_\alpha f(x_k - \alpha g_k)$ | Optimal per step | Expensive (1D optimization each step) |
| Backtracking (Armijo) | Start large, shrink until sufficient decrease | Practical, cheap | Needs two parameters ($c$, $\rho$) |
| Barzilai-Borwein | $\alpha_k = \frac{s_{k-1}^T y_{k-1}}{y_{k-1}^T y_{k-1}}$ | Fast, adaptive | Non-monotone (can increase $f$) |
Accept step $\alpha_k$ if:
$$f(x_k - \alpha_k g_k) \leq f(x_k) - c \cdot \alpha_k \|g_k\|^2$$
where $c \in (0, 1)$ (typically $c = 10^{-4}$). Start with $\alpha = 1$ and shrink by factor $\rho \in (0,1)$ until satisfied.
def backtracking_line_search(f, x, g, alpha=1.0, c=1e-4, rho=0.5):
fx = f(x)
while f(x - alpha * g) > fx - c * alpha * np.dot(g, g):
alpha *= rho
return alpha
For a strongly convex function with condition number $\kappa = \lambda_{\max}(H) / \lambda_{\min}(H)$:
$$f(x_{k+1}) - f(x^*) \leq \left(\frac{\kappa - 1}{\kappa + 1}\right)^2 (f(x_k) - f(x^*))$$
OKS relevance: This is why pure gradient descent is never used in the navigation estimator. $\kappa$ for the sensorbar problem can be $>10^4$.
Use the second-order Taylor model to compute the step:
$$\Delta x = -[\nabla^2 f(x_k)]^{-1} \nabla f(x_k) = -H_k^{-1} g_k$$
def newtons_method(f, grad_f, hess_f, x0, tol=1e-8, max_iter=100):
x = x0.copy()
for k in range(max_iter):
g = grad_f(x)
H = hess_f(x)
if np.linalg.norm(g) < tol:
break
dx = np.linalg.solve(H, -g) # Never invert H explicitly!
x = x + dx
return x, k
| Property | Gradient Descent | Newton's Method |
|---|---|---|
| Step direction | $-g$ (steepest descent) | $-H^{-1}g$ (accounts for curvature) |
| Convergence | Linear: $O(\kappa \log(1/\epsilon))$ | Quadratic: $O(\log\log(1/\epsilon))$ |
| Per-step cost | $O(n)$ | $O(n^3)$ (Hessian solve) |
| Condition-dependent? | Yes (severely) | No (self-correcting) |
Quadratic convergence means the number of correct digits doubles each step. Once you're close, convergence is explosive: - Step $k$: error = $10^{-2}$ - Step $k+1$: error = $10^{-4}$ - Step $k+2$: error = $10^{-8}$
Solutions: Line search along Newton direction, or trust region (see §5).
Approximate the Hessian $H_k$ (or its inverse $B_k = H_k^{-1}$) using only gradient information. No second derivatives needed.
Given step $s_k = x_{k+1} - x_k$ and gradient change $y_k = g_{k+1} - g_k$:
$$B_{k+1} = \left(I - \frac{s_k y_k^T}{y_k^T s_k}\right) B_k \left(I - \frac{y_k s_k^T}{y_k^T s_k}\right) + \frac{s_k s_k^T}{y_k^T s_k}$$
For large $n$: don't store the full $n \times n$ matrix $B_k$. Instead, store the last $m$ pairs $(s_k, y_k)$ (typically $m = 5$–$20$) and compute $B_k g_k$ on the fly.
minimize(method='L-BFGS-B')OKS relevance: When Ceres uses DENSE_QR or DENSE_NORMAL_CHOLESKY for small problems, it's effectively doing Newton. For large problems, it uses iterative solvers that share ideas with L-BFGS.
Instead of choosing a direction then a step size (line search), choose both simultaneously by solving:
$$\min_{\Delta x} \quad m_k(\Delta x) = f_k + g_k^T \Delta x + \frac{1}{2} \Delta x^T B_k \Delta x$$ $$\text{subject to} \quad \|\Delta x\| \leq \Delta_k$$
where $\Delta_k$ is the trust region radius.
1. Solve the trust-region subproblem for Δx
2. Compute actual vs predicted reduction:
ρ_k = (f(x_k) - f(x_k + Δx)) / (m_k(0) - m_k(Δx))
3. If ρ_k > threshold: accept step, possibly expand trust region
If ρ_k ≈ 0 or < 0: reject step, shrink trust region
4. Update x_k+1 and Δ_{k+1}
| Aspect | Line Search | Trust Region |
|---|---|---|
| Step direction | Fixed (Newton, GD) | Adapted to region |
| Step length | Varies along direction | Bounded by radius |
| Far from minimum | Can take bad steps | Naturally conservative |
| Hessian not PD | Needs modification | Handled by constraint |
The LM algorithm (Chapter 03) can be interpreted as a trust-region method where:
$$\Delta x = -(J^T J + \lambda I)^{-1} J^T r$$
This is exactly what Ceres does. The TrustRegionStrategy in Ceres toggles between LEVENBERG_MARQUARDT and DOGLEG.
Improve upon gradient descent by choosing directions that are conjugate with respect to $A$:
$$d_i^T A d_j = 0 \quad \text{for } i \neq j$$
For a quadratic $f(x) = \frac{1}{2}x^TAx - b^Tx$, CG solves $Ax = b$ in exactly $n$ steps (in exact arithmetic).
def conjugate_gradient(A, b, x0, tol=1e-8, max_iter=None):
x = x0.copy()
r = b - A @ x # residual
d = r.copy() # search direction
for k in range(max_iter or len(b)):
Ad = A @ d
alpha = r.dot(r) / d.dot(Ad)
x = x + alpha * d
r_new = r - alpha * Ad
if np.linalg.norm(r_new) < tol:
break
beta = r_new.dot(r_new) / r.dot(r)
d = r_new + beta * d
r = r_new
return x
ITERATIVE_SCHUROKS relevance: Ceres uses CG-based solvers (CGNR, ITERATIVE_SCHUR) for large SLAM-like problems where the Hessian is too large to factorize directly.
| Method | Direction | Step | Convergence | Per-step cost | When to use |
|---|---|---|---|---|---|
| Gradient descent | $-g$ | Line search | Linear | $O(n)$ | ML training, huge $n$ |
| Newton | $-H^{-1}g$ | Line search / TR | Quadratic | $O(n^3)$ | Small $n$, fast convergence needed |
| BFGS | $-B_k g$ | Line search | Superlinear | $O(n^2)$ | Medium $n$, no Hessian available |
| L-BFGS | $-B_k g$ (implicit) | Line search | Superlinear | $O(mn)$ | Large $n$, limited memory |
| Trust region (Newton) | TR subproblem | Radius | Quadratic | $O(n^3)$ | Robust, general purpose |
| Conjugate gradient | Conjugate dirs | Exact (quadratic) | $\leq n$ steps | $O(n)$ per step | Huge sparse systems |
| LM (next chapter) | $(J^TJ+\lambda I)^{-1}J^Tr$ | Trust region | Superlinear | $O(mn^2)$ | Least-squares (Ceres!) |
Is f(x) a sum of squares: Σ r_i(x)²?
├── YES → Use Gauss-Newton / LM (Chapter 03)
│ (exploits structure, much faster)
│
└── NO → Is n large (> 10,000)?
├── YES → L-BFGS or CG
│ (can't store Hessian)
│
└── NO → Do you have the Hessian?
├── YES → Newton + trust region
│ (quadratic convergence)
│
└── NO → BFGS
(builds Hessian approximation from gradients)
| Concept | Where it's used next |
|---|---|
| Line search (Armijo) | 03 (step acceptance in LM) |
| Newton step $-H^{-1}g$ | 03 (Gauss-Newton as approximate Newton) |
| Trust region | 03 (LM is a trust-region method), 06 (pose-graph) |
| Condition number → convergence | 07 (numerical methods) |
| CG solver | 06 (large pose-graph), 07 (sparse systems) |
| Quadratic model | 03 (the key approximation in Gauss-Newton) |