Day 13: Iterative Methods — Conjugate Gradient

Phase I — Mathematical Foundations | Week 2 | 2.5 hours CG solves PD systems in n steps — but with preconditioning, it often needs far fewer.

Previous: Day 12: Conditioning & Function Landscapes
Next: Day 14: Week 2 Review
Week: Week 2 Overview
Curriculum: CURRICULUM.md

OKS Relevance

When solving the Schur complement system $Sx_c = g_c$ in BA, $S$ is dense and moderately sized (hundreds to thousands). Ceres uses CG with block-Jacobi preconditioning on $S$ (the ITERATIVE_SCHUR strategy). For large-scale SLAM, preconditioned CG on the full sparse system is sometimes preferred over direct factorization. CG's convergence rate depends directly on $\kappa(S)$ — everything from Day 12 feeds into this.

Theory (45 min)

13.1 Why Iterative?

Method	Cost per step	Total cost ($n \times n$ system)	Memory
Dense Cholesky	—	$O(n^3)$	$O(n^2)$
Sparse Cholesky	—	$O(n \cdot \text{nnz}^2)$	Fill-in dependent
CG	$O(\text{nnz})$	$O(k \cdot \text{nnz})$, $k \ll n$	$O(n)$

CG wins when: (1) $n$ is huge, (2) the matrix is sparse, (3) you only need an approximate solution, or (4) you can't afford the memory for factorization.

13.2 CG for Positive Definite Systems

Goal: Solve $Ax = b$ where $A$ is SPD ($A = A^T$, $x^TAx > 0$ for $x \neq 0$).

Key insight: Solving $Ax = b$ is equivalent to minimizing:

$$f(x) = \frac{1}{2}x^TAx - b^Tx$$

CG minimizes $f$ over successive Krylov subspaces $\mathcal{K}_k(A, r_0) = \text{span}\{r_0, Ar_0, A^2r_0, \ldots, A^{k-1}r_0\}$.

13.3 The CG Algorithm

r₀ = b - Ax₀
p₀ = r₀

for k = 0, 1, 2, ...
    αₖ = (rₖᵀrₖ) / (pₖᵀApₖ)         # step size
    xₖ₊₁ = xₖ + αₖpₖ                  # update solution
    rₖ₊₁ = rₖ - αₖApₖ                 # update residual
    βₖ = (rₖ₊₁ᵀrₖ₊₁) / (rₖᵀrₖ)       # CG coefficient
    pₖ₊₁ = rₖ₊₁ + βₖpₖ                # update direction

Properties: - Exact solution in at most $n$ steps (in exact arithmetic) - Search directions $p_k$ are A-conjugate: $p_i^T A p_j = 0$ for $i \neq j$ - Residuals are orthogonal: $r_i^T r_j = 0$ for $i \neq j$ - Each step only needs one matrix-vector product $Ap_k$

13.4 Convergence Rate

After $k$ iterations:

$$\frac{\|x_k - x^*\|_A}{\|x_0 - x^*\|_A} \leq 2 \left(\frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1}\right)^k$$

where $\kappa = \kappa(A) = \lambda_{\max}/\lambda_{\min}$.

Rule of thumb: To reduce error by factor $10^{-p}$, need $k \approx \frac{p}{2}\sqrt{\kappa}$ iterations.

$\kappa$	Iterations for $10^{-6}$ accuracy
$10$	~10
$10^2$	~30
$10^4$	~300
$10^6$	~3000

13.5 Preconditioned CG (PCG)

Replace $Ax = b$ with $M^{-1}Ax = M^{-1}b$ where $M \approx A$:

r₀ = b - Ax₀
z₀ = M⁻¹r₀
p₀ = z₀

for k = 0, 1, 2, ...
    αₖ = (rₖᵀzₖ) / (pₖᵀApₖ)
    xₖ₊₁ = xₖ + αₖpₖ
    rₖ₊₁ = rₖ - αₖApₖ
    zₖ₊₁ = M⁻¹rₖ₊₁                    # preconditioner solve
    βₖ = (rₖ₊₁ᵀzₖ₊₁) / (rₖᵀzₖ)
    pₖ₊₁ = zₖ₊₁ + βₖpₖ

Now convergence depends on $\kappa(M^{-1}A)$ instead of $\kappa(A)$.

13.6 CG as an Optimizer

CG is also used for unconstrained optimization (nonlinear CG): - Fletcher-Reeves: $\beta_k = \frac{\|g_{k+1}\|^2}{\|g_k\|^2}$ - Polak-Ribière: $\beta_k = \frac{g_{k+1}^T(g_{k+1} - g_k)}{\|g_k\|^2}$

These will appear in Week 3 (Day 20). For now, focus on CG for linear systems.

Implementation (60 min)

Exercise 1: CG from Scratch

# See: code/week02/iterative_methods.py

Implement the CG algorithm. Verify it converges in exactly $n$ steps for an $n \times n$ system.

Exercise 2: Convergence Monitoring

Track $\|r_k\|$ vs iteration. Show the dependence on $\kappa$.

Exercise 3: Preconditioned CG

Implement PCG with Jacobi preconditioning. Compare iteration counts.

Exercise 4: Sparse PD Benchmark

Create a large sparse PD system (1D Laplacian, $n = 10000$). Compare: (a) dense solve, (b) sparse direct solve, (c) CG, (d) PCG.

Practice (45 min)

Prove that the CG search directions are A-conjugate (i.e., $p_i^T A p_j = 0$).
Why does CG require $A$ to be SPD? What goes wrong if $A$ has a negative eigenvalue?
CG needs one matrix-vector product per iteration. For a sparse matrix with nnz nonzeros, what's the cost per iteration?
You're solving a $10^6 \times 10^6$ sparse system with $\kappa = 10^4$. Estimate the CG iteration count.
The Jacobi preconditioner is $M = \text{diag}(A)$. When is it effective? When does it fail?
Why is CG memory-efficient compared to GMRES for symmetric systems?
In Ceres, ITERATIVE_SCHUR uses CG on the Schur complement. Why not use CG on the full system?
If CG converges in 50 iterations for a $10000 \times 10000$ system, what does this tell you about the eigenvalue distribution?

Expert Challenges

🎯 Challenge 1: CG as polynomial optimization

**Problem:** Show that CG finds the best polynomial approximation $p_k(\lambda)$ of degree $k$ such that $\|x_k - x^*\|_A = \min_{p_k: p_k(0)=1} \max_{\lambda \in [\lambda_{\min}, \lambda_{\max}]} |p_k(\lambda)| \cdot \|x_0 - x^*\|_A$. Why does this explain superlinear convergence when eigenvalues cluster? **Solution:** CG's error at step $k$ satisfies $\|e_k\|_A \leq \min_{p_k(0)=1} \max_i |p_k(\lambda_i)| \cdot \|e_0\|_A$ where the min is over degree-$k$ polynomials with $p(0) = 1$. If eigenvalues cluster into $m$ groups, a degree-$m$ polynomial can nearly vanish at all eigenvalues, giving fast convergence in $\sim m \ll n$ steps. This is why preconditioning that clusters eigenvalues (even without reducing $\kappa$ much) can dramatically accelerate CG.

🎯 Challenge 2: Truncated CG for trust regions

**Problem:** In trust-region methods (Week 3), we need to solve $H \Delta x = -g$ subject to $\|\Delta x\| \leq \Delta$. Implement Steihaug-Toint truncated CG: run CG but stop early if (a) the iterate reaches the trust region boundary, or (b) negative curvature is detected ($p_k^T H p_k \leq 0$). **Solution:** At each CG step, check: (1) if $p_k^T H p_k \leq 0$, go to the trust region boundary along $p_k$ (negative curvature = not PD in this direction). (2) if $\|x_{k+1}\| > \Delta$, find $\tau$ such that $\|x_k + \tau p_k\| = \Delta$ (quadratic in $\tau$, take positive root). The truncated CG-Steihaug method provides an approximate solution to the trust-region subproblem in at most $n$ CG steps, naturally handling indefinite Hessians.

🎯 Challenge 3: Block CG for multiple right-hand sides

**Problem:** In SLAM, you sometimes need to solve $AX = B$ where $B \in \mathbb{R}^{n \times s}$ ($s$ RHS vectors). Standard CG solves each column independently. Design Block CG that shares Krylov subspace information across columns. **Solution:** Block CG maintains $s$ search directions simultaneously and orthogonalizes across all of them. The key modification: $P_k \in \mathbb{R}^{n \times s}$, $R_k \in \mathbb{R}^{n \times s}$, and the step size becomes a matrix $\alpha_k = (P_k^TAP_k)^{-1}P_k^TR_k$. When the RHS vectors share structure (common in SLAM covariance recovery), Block CG converges in fewer matrix-vector products total than $s$ independent CG runs. Deflation of converged columns is important for efficiency.

Self-Assessment Checklist

[ ] I can implement CG from scratch and verify $n$-step convergence
[ ] I understand the convergence bound and its dependence on $\kappa$
[ ] I can implement and apply Jacobi preconditioning
[ ] I know when to choose CG over direct solvers

Key Takeaways

CG solves SPD systems with $O(\text{nnz})$ per iteration — unbeatable for large sparse systems.
Convergence depends on $\sqrt{\kappa}$ — preconditioning is essential for ill-conditioned problems.
Preconditioned CG replaces $\kappa(A)$ with $\kappa(M^{-1}A)$ — Jacobi is the simplest useful preconditioner.
Ceres uses PCG for the Schur complement in BA (ITERATIVE_SCHUR + CLUSTER_JACOBI).