Day 12: Function Landscapes & Conditioning

Phase I — Mathematical Foundations | Week 2 | 2.5 hours A well-conditioned problem is half-solved; an ill-conditioned one fights you at every step.

Previous: Day 11: Automatic Differentiation
Next: Day 13: Iterative Methods (CG)
Week: Week 2 Overview
Curriculum: CURRICULUM.md

OKS Relevance

When OKS SLAM's Hessian has eigenvalues spanning $10^6$, gradient descent crawls along narrow valleys while overshooting across ridges. The condition number $\kappa(H)$ quantifies this. Preconditioning and variable scaling transform the problem into a rounder landscape where solvers converge fast. Ceres's JACOBI scaling and CLUSTER_JACOBI preconditioner exist for exactly this reason.

Theory (45 min)

12.1 Condition Number

For a symmetric positive definite matrix $H$:

$$\kappa(H) = \frac{\lambda_{\max}}{\lambda_{\min}}$$

Meaning: A perturbation of $\delta$ in the input (RHS) causes up to $\kappa \cdot \delta$ perturbation in the solution.

$\kappa(H)$	Interpretation
$\approx 1$	Perfectly conditioned (circular contours)
$10^3$	Moderately ill-conditioned
$10^6$	Ill-conditioned (common in SLAM)
$> 10^{15}$	Effectively singular ($\approx \frac{1}{\epsilon_{\text{mach}}}$)

12.2 Geometric View: Contour Plots

For $f(x) = \frac{1}{2} x^T H x$: - $\kappa = 1$: circular contours → all directions have equal curvature → gradient points to minimum - $\kappa \gg 1$: highly elliptical contours → gradient zig-zags → slow convergence

Gradient descent iterations ∝ $\kappa$ for quadratic problems.

12.3 Sources of Ill-Conditioning in Robotics

Scale mismatch: Translation parameters (meters) vs rotation parameters (radians) in the same system
Gauge freedom: Absolute position is unobservable → Hessian is rank-deficient
Poorly excited directions: Some parameters barely affect the cost
Distant landmarks: Range observations have $\sim 1/r^2$ information

12.4 Preconditioning

Transform $Hx = g$ into $M^{-1}Hx = M^{-1}g$ where $M \approx H$.

$$\kappa(M^{-1}H) \ll \kappa(H)$$

Common preconditioners: | Preconditioner | $M$ | Cost | Effectiveness | |---------------|-----|------|---------------| | Jacobi | $\text{diag}(H)$ | $O(n)$ | Moderate | | Block Jacobi | Block-diagonal of $H$ | $O(n \cdot b^2)$ | Good for block-structured problems | | Incomplete Cholesky | Sparse approximate $LL^T$ | $O(n \cdot \text{nnz})$ | Very good | | Exact | $H$ itself | $O(n^3)$ | Perfect (but defeats the purpose) |

12.5 Variable Scaling

Before optimizing, scale variables so all have similar magnitude:

$$\tilde{x}_i = x_i / s_i$$

where $s_i$ is a characteristic scale. This is equivalent to preconditioning with $M = \text{diag}(s_1^2, \ldots, s_n^2)$.

Ceres: Uses ParameterBlockLocalParameterization and ScaledLoss to handle scaling. The JACOBI scaling option divides each column of the Jacobian by its norm.

12.6 When Conditioning Goes Wrong

An ill-conditioned Hessian means: - Direct solvers lose digits of precision: $\log_{10}(\text{digits lost}) \approx \log_{10}(\kappa)$ - Iterative solvers converge slowly: CG needs $O(\sqrt{\kappa})$ iterations - Gradient descent needs $O(\kappa)$ iterations - The solver may report convergence but the solution is inaccurate

Implementation (60 min)

Exercise 1: Condition Number Visualization

# See: code/week02/conditioning.py

Create quadratic functions with $\kappa = 1, 10, 100, 1000$. Plot contours and gradient descent paths.

Exercise 2: Digits Lost

Solve $Hx = g$ for matrices with various $\kappa$. Measure actual digits of precision in the solution.

Exercise 3: Preconditioning

Implement Jacobi preconditioning. Show it reduces the effective condition number and speeds up CG convergence.

Exercise 4: Variable Scaling

Take a SLAM-like problem with mixed translation/rotation parameters. Show that scaling improves convergence.

Practice (45 min)

A $2 \times 2$ matrix has eigenvalues 1 and 1000. Draw its contour plot and the gradient descent path from $(1, 1)$.
You solve $Ax = b$ and get 8 correct digits. The true condition number is $10^8$. How many digits could be correct?
Why does adding a regularizer $\lambda I$ to $H$ improve conditioning? What does it cost?
In SLAM, translation parameters are in meters and rotation in radians. If a robot moves 100m with 0.01 rad rotations, what's the rough condition number due to scale mismatch?
Explain why the Gauss-Newton Hessian $J^T J$ can be ill-conditioned even if $J$ is well-conditioned.
CG converges in $O(\sqrt{\kappa})$ iterations. If $\kappa = 10^6$, how many iterations? After Jacobi preconditioning reduces it to $10^3$?
Why is incomplete Cholesky a good preconditioner for sparse systems?
What's the condition number of a diagonal matrix? Of the identity matrix?

Expert Challenges

🎯 Challenge 1: Automatic scaling for SLAM

**Problem:** Design an automatic variable-scaling scheme for a pose-graph SLAM problem. Given the Jacobian structure (some columns correspond to x/y translation, some to θ rotation), propose and implement a scaling that brings all columns to similar norm. **Solution:** Column scaling: $\tilde{J}_{\cdot,j} = J_{\cdot,j} / \|J_{\cdot,j}\|$. This is what Ceres `JACOBI` scaling does. For SLAM: translation columns have norm $\propto 1/\sigma_t$ (measurement noise), rotation columns have norm $\propto r/\sigma_\theta$ where $r$ is the range. If $\sigma_t = 0.01$m and $\sigma_\theta = 0.001$rad at $r = 10$m, the norms differ by $\sim 100\times$. After column scaling, $\kappa(J^TJ)$ drops dramatically. Caveat: the scaled solution $\tilde{x}$ must be unscaled: $x_j = \tilde{x}_j / s_j$.

🎯 Challenge 2: Condition number of the normal equations

**Problem:** Show that $\kappa(J^T J) = \kappa(J)^2$. Why does this mean you should avoid forming the normal equations explicitly when the Jacobian is moderately ill-conditioned? **Solution:** $J^T J$ has singular values $\sigma_i^2$ where $\sigma_i$ are the singular values of $J$. So $\kappa(J^TJ) = \sigma_{\max}^2/\sigma_{\min}^2 = \kappa(J)^2$. If $\kappa(J) = 10^4$ (fine), then $\kappa(J^TJ) = 10^8$ (losing 8 digits). Alternatives: (1) QR factorization of $J$ directly ($O(mn^2)$ but numerically stable), (2) use the augmented system $\begin{pmatrix}I & J \\ J^T & 0\end{pmatrix}\begin{pmatrix}r\\x\end{pmatrix}=\begin{pmatrix}b\\0\end{pmatrix}$ which has $\kappa = \kappa(J)$. In practice, for SLAM's sparse $J^TJ$ the sparsity advantage of normal equations usually outweighs the conditioning cost, especially with LM regularization $J^TJ + \lambda I$.

🎯 Challenge 3: Preconditioner design for BA

**Problem:** In bundle adjustment, the Schur complement $S$ is a dense camera-only matrix. Design a good preconditioner for $S$. Hint: consider the block structure where each camera contributes a $6 \times 6$ diagonal block. **Solution:** **Block Jacobi preconditioner:** $M = \text{blockdiag}(S_{11}, S_{22}, \ldots, S_{kk})$ where $S_{ii}$ is the $6 \times 6$ block corresponding to camera $i$. This is cheap to compute ($k$ inversions of $6 \times 6$ matrices) and captures the dominant diagonal structure. Ceres calls this `CLUSTER_JACOBI`. For even better preconditioning, cluster nearby cameras and use the block for each cluster (Ceres `CLUSTER_TRIDIAGONAL`). The ideal preconditioner would be $S$ itself, but that defeats the purpose of using CG on $S$.

Self-Assessment Checklist

[ ] I can compute and interpret the condition number of a matrix
[ ] I understand the geometric effect of ill-conditioning on optimization
[ ] I can implement Jacobi preconditioning and measure its effect
[ ] I know the sources of ill-conditioning in robotics problems

Key Takeaways

Condition number $\kappa = \lambda_{\max}/\lambda_{\min}$ determines how hard a linear system (and the associated optimization) is to solve.
Ill-conditioning makes gradient-based methods slow and solutions inaccurate.
Preconditioning transforms the problem to reduce $\kappa$ — Jacobi (diagonal) is the simplest and often sufficient.
Variable scaling is the easiest fix — ensure all parameters have similar magnitude before optimizing.