Goal: Understand the numerical pitfalls that make textbook algorithms fail in practice — conditioning, convergence, sparsity, and the tricks that real solvers use.
IEEE 754 double precision: 64 bits → ~15–16 significant decimal digits.
$$\text{Machine epsilon:} \quad \epsilon_{\text{mach}} \approx 2.22 \times 10^{-16}$$
Every floating-point operation introduces relative error $\leq \epsilon_{\text{mach}}$.
Subtracting nearly equal numbers amplifies relative error:
# BAD: catastrophic cancellation
a = 1.0000000000001
b = 1.0000000000000
diff = a - b # Only ~1 digit of accuracy left!
# BETTER: reformulate to avoid subtraction
# Use log1p(x) instead of log(1+x) for small x
# Use expm1(x) instead of exp(x)-1 for small x
|x_new - x_old| < tol fails near machine precisionFor a matrix $A$:
$$\kappa(A) = \|A\| \cdot \|A^{-1}\| = \frac{\sigma_{\max}(A)}{\sigma_{\min}(A)}$$
$\kappa(A)$ measures how much the solution $x$ can change when $b$ is perturbed in $Ax = b$:
$$\frac{\|\Delta x\|}{\|x\|} \leq \kappa(A) \frac{\|\Delta b\|}{\|b\|}$$
| $\kappa(A)$ | Interpretation | Digits lost |
|---|---|---|
| $10^0$ = 1 | Perfect conditioning | 0 |
| $10^3$ | Mild | ~3 |
| $10^6$ | Moderate | ~6 |
| $10^{10}$ | Severe | ~10 (only 5–6 digits reliable) |
| $10^{16}$ | Singular to working precision | All |
The Hessian condition number determines convergence: - Gradient descent: $O(\kappa)$ iterations - Conjugate gradient: $O(\sqrt{\kappa})$ iterations - Newton: independent of $\kappa$ (but solving $H\Delta x = -g$ is sensitive to $\kappa(H)$)
import numpy as np
A = np.array([[1, 1], [1, 1.0001]]) # Nearly singular
print(f"Condition number: {np.linalg.cond(A):.0f}") # ~20,000
# For the normal equations:
J = compute_jacobian(x)
H = J.T @ J
print(f"Hessian condition: {np.linalg.cond(H):.0f}")
# Note: cond(J^T J) = cond(J)^2 ← squaring makes things worse!
OKS relevance: The normal equations $J^TJ \Delta x = -J^Tr$ square the condition number. This is why Ceres prefers QR factorization (DENSE_QR) over normal equations for ill-conditioned problems.
Every iteration of Newton/GN/LM requires solving:
$$H \Delta x = -g$$
How you solve this determines practical performance.
| Method | Cost | Requirements | Stability |
|---|---|---|---|
| LU decomposition | $O(n^3)$ | General $H$ | Good with pivoting |
| Cholesky $H = LL^T$ | $O(n^3/3)$ | $H \succ 0$ (positive definite) | Excellent |
| QR of $J$ | $O(mn^2)$ | Overdetermined LS | Better than normal eqs |
| Sparse Cholesky | $O(\text{fill-in})$ | Sparse $H \succ 0$ | Excellent |
Is H sparse?
├── YES → How sparse?
│ ├── Very sparse (SLAM, large-scale) → Sparse Cholesky (CHOLMOD)
│ │ or Iterative (CG/LSQR)
│ │
│ └── Moderately sparse → Sparse direct (SuiteSparse)
│
└── NO (dense) → Is H positive definite?
├── YES → Cholesky (fastest)
└── NO → LU with pivoting, or add damping (LM)
For very large systems where even sparse Cholesky is too expensive:
| Method | Applies to | Per-iteration | Convergence |
|---|---|---|---|
| Conjugate Gradient | $H \succ 0$, symmetric | $O(\text{nnz}(H))$ | $O(\sqrt{\kappa})$ |
| LSQR | Rectangular $J$ | $O(\text{nnz}(J))$ | $O(\sqrt{\kappa})$ |
| GMRES | General non-symmetric | $O(\text{nnz}(H) \cdot k)$ | Depends |
Preconditioning dramatically speeds up iterative solvers by reducing the effective $\kappa$.
In SLAM/graph optimization, $H$ has a specific sparsity pattern determined by the graph structure.
Original H: After Cholesky L:
[X . . X] [X . . .]
[. X X .] [. X X .]
[. X X X] → [. X X X]
[X . X X] [X F X X] ← F = "fill-in" (new non-zero)
Fill-in = new non-zero entries created during factorization. More fill-in → more computation.
The ordering of variables dramatically affects fill-in.
| Ordering | Fill-in | Method |
|---|---|---|
| Natural (as given) | Can be terrible | Default |
| AMD (Approximate Min Degree) | Good general-purpose | SuiteSparse, Ceres default |
| METIS (nested dissection) | Best for very large | Graph partitioning |
| COLAMD (column AMD) | Good for non-square | Used with QR |
// In Ceres:
options.linear_solver_ordering_type = ceres::AMD; // default
options.linear_solver_ordering_type = ceres::NESDIS; // nested dissection
In BA/SLAM, variables split into: - Poses (few connections each) — "camera" variables - Landmarks (connected to many poses) — "point" variables
The landmark block of $H$ is block-diagonal → can be eliminated cheaply:
$$H_{pp} - H_{pl} H_{ll}^{-1} H_{lp}) \Delta x_p = -g_p + H_{pl} H_{ll}^{-1} g_l$$
This Schur complement system is much smaller and denser.
Ceres exploits this with ITERATIVE_SCHUR and SCHUR_JACOBI preconditioners.
| Rate | Definition | Example |
|---|---|---|
| Linear | $\|x_{k+1} - x^*\| \leq c \|x_k - x^*\|$, $c < 1$ | Gradient descent |
| Superlinear | $\|x_{k+1} - x^*\| / \|x_k - x^*\| \to 0$ | BFGS, LM |
| Quadratic | $\|x_{k+1} - x^*\| \leq c \|x_k - x^*\|^2$ | Newton |
What to monitor during optimization:
for k in range(max_iter):
# Compute step
...
# Track these quantities:
cost_k = 0.5 * np.sum(r**2)
grad_norm = np.linalg.norm(J.T @ r)
step_norm = np.linalg.norm(dx)
# Convergence criteria (Ceres uses all three):
# 1. Function tolerance: relative change in cost
if abs(cost_k - cost_prev) / cost_prev < func_tol:
break # Cost barely changing
# 2. Gradient tolerance: near-zero gradient
if grad_norm < grad_tol:
break # At a stationary point
# 3. Parameter tolerance: step is tiny
if step_norm < param_tol * np.linalg.norm(x):
break # Parameters barely moving
| Symptom | Cause | Fix |
|---|---|---|
| Cost oscillates | Step too large / trust region issue | Reduce initial trust region, use LM |
| Cost decreases but never converges | Very flat minimum | Relax function_tolerance |
| Cost stalls early | Stuck at saddle / local min | Better initialization, multiple starts |
| Very slow decrease | Ill-conditioning | Precondition, rescale variables |
| NaN/Inf in cost or gradient | Numerical issue in residual | Check residual function for edge cases |
If variables have very different magnitudes ($x_1 \approx 10^6$, $x_2 \approx 10^{-3}$), the Hessian is ill-conditioned.
Fix: Scale variables so they're all $O(1)$:
$$\tilde{x} = Dx \quad \text{where } D = \text{diag}(\text{typical values})$$
# Before optimization, scale:
x_scale = np.array([1e6, 1e-3, 1.0]) # typical magnitudes
x_scaled = x0 / x_scale
# Optimize in scaled space
result = optimize(f_scaled, x_scaled)
# Unscale
x_final = result * x_scale
// Ceres can auto-scale the Jacobian columns:
options.jacobi_scaling = true; // Normalizes each column of J
This is equivalent to a diagonal preconditioner on the normal equations.
For iterative solving of $H\Delta x = -g$:
Solve $M^{-1}H \Delta x = -M^{-1}g$ where $M \approx H$ but cheap to invert.
| Preconditioner | Cost | Quality |
|---|---|---|
| Jacobi (diagonal of $H$) | $O(n)$ | Weak but cheap |
| Block Jacobi | $O(nb^3)$ ($b$ = block size) | Medium |
| Incomplete Cholesky | $O(\text{nnz})$ | Good |
| Schur-Jacobi (for BA) | Problem-dependent | Excellent for SLAM |
A dual number: $a + b\epsilon$ where $\epsilon^2 = 0$.
For $n$ parameters, Ceres uses $\text{Jet}
// Ceres Jet: value + N partial derivatives
template <typename T, int N>
struct Jet {
T a; // function value
T v[N]; // partial derivatives ∂f/∂x_i
};
// Arithmetic automatically propagates derivatives:
// (a + v·ε) * (b + w·ε) = ab + (aw + bv)·ε
Cost: Each operation costs ~$(N+1)$ floating-point ops instead of 1.
For a cost function with $n$ parameters and $m$ residuals: - Forward AD: $O(mn)$ — evaluate $m$ residuals, each tracking $n$ derivatives - This is the same complexity as analytic Jacobians
For $f : \mathbb{R}^n \to \mathbb{R}$ (scalar output): - Forward mode: $n$ passes to get full gradient - Reverse mode: 1 pass (backpropagation!)
For $f : \mathbb{R}^n \to \mathbb{R}^m$: - Forward is better when $n < m$ (few params, many residuals) - Reverse is better when $n > m$ (many params, few outputs)
Ceres uses forward mode because NLLS has $m \gg n$ (many residuals, few parameters per block).
When the residual function is a black box (Fortran code, hardware-in-the-loop):
$$\frac{\partial f}{\partial x_i} \approx \frac{f(x + h e_i) - f(x - h e_i)}{2h}$$
// When you can't template the cost function:
ceres::CostFunction* cost =
new ceres::NumericDiffCostFunction<MyCostFunctor,
ceres::CENTRAL, // FORWARD, CENTRAL, or RIDDERS
1, // num residuals
3>( // num params
new MyCostFunctor());
Ridders' method is the most accurate (extrapolates multiple step sizes) but 4× more expensive than central.
| Pitfall | Where | Symptom | Fix |
|---|---|---|---|
| $\tan(\theta)$ at $\pm\pi/2$ | Sensorbar model | NaN in residual | Use $\sin/\cos$ parameterization |
| Division by small distance | Distance-based costs | Huge gradients | Add epsilon: $d + \epsilon$ |
| Angle wrapping | Pose error computation | 2π jumps in residual | Use atan2, normalize to $[-\pi, \pi]$ |
| Quaternion normalization | 3D rotation | Drift off unit sphere | Use manifold, or renormalize each step |
| Covariance matrix inversion | Information matrix from EKF | Singular $P$ → crash | Use pseudo-inverse, or regularize $P + \epsilon I$ |
| Log of negative number | In robust loss computation | NaN propagation | Guard with max(val, epsilon) |
PROBLEM TYPE?
│
├── Nonlinear Least Squares (sum of squares)
│ ├── n < 100, dense → Ceres DENSE_QR
│ ├── n < 1000, dense → Ceres DENSE_NORMAL_CHOLESKY
│ ├── Large, sparse (not SLAM) → Ceres SPARSE_NORMAL_CHOLESKY
│ └── SLAM / BA structure → Ceres ITERATIVE_SCHUR or g2o/GTSAM
│
├── Convex (LP, QP, SOCP, SDP)
│ ├── LP → HiGHS, Gurobi
│ ├── QP → OSQP (MPC), Gurobi
│ ├── SOCP → ECOS
│ └── SDP → SCS, MOSEK
│
├── General NLP (nonlinear + constraints)
│ ├── Medium → SQP (SNOPT, fmincon)
│ └── Large → Interior-point (IPOPT + CasADi)
│
└── Real-time / embedded
├── QP for MPC → qpOASES, OSQP, FORCES
└── Simple NLLS → hand-coded LM