Day 10: Numerical Differentiation Deep Dive

Phase I — Mathematical Foundations | Week 2 | 2.5 hours Trust but verify: every hand-derived Jacobian should be checked numerically.

Previous: Day 9: Sparse Matrices & Schur Complement
Next: Day 11: Automatic Differentiation
Week: Week 2 Overview
Curriculum: CURRICULUM.md

OKS Relevance

Ceres provides NumericDiffCostFunction as a fallback when analytic or auto-diff Jacobians aren't available. Understanding the accuracy limits of numerical differentiation tells you why Ceres auto-diff is almost always preferred, and why the numerical Jacobian checker is indispensable for debugging.

Theory (45 min)

10.1 Forward Difference

$$f'(x) \approx \frac{f(x+h) - f(x)}{h} + O(h)$$

First-order accurate. Error has two components: - Truncation error: $O(h)$ — decreases as $h \to 0$ - Rounding error: $O(\epsilon_{\text{mach}} / h)$ — increases as $h \to 0$

10.2 Central Difference

$$f'(x) \approx \frac{f(x+h) - f(x-h)}{2h} + O(h^2)$$

Second-order accurate. Two function evaluations per component.

10.3 Complex-Step Derivative

$$f'(x) = \frac{\text{Im}(f(x + ih))}{h} + O(h^2)$$

No cancellation error! Works because $f(x+ih) = f(x) + ihf'(x) - \frac{h^2}{2}f''(x) + \ldots$

The imaginary part gives $hf'(x) + O(h^3)$. Since there's no subtraction of nearly-equal numbers, $h$ can be tiny ($10^{-30}$) without rounding problems.

Limitation: Requires $f$ to be implementable with complex arithmetic (no abs, max, boolean branches on values).

10.4 Optimal Step Size

Method	Optimal $h$	Achievable error
Forward	$\sqrt{\epsilon_{\text{mach}}} \approx 10^{-8}$	$\sqrt{\epsilon} \approx 10^{-8}$
Central	$\epsilon^{1/3} \approx 10^{-5.3}$	$\epsilon^{2/3} \approx 10^{-10.7}$
Complex-step	As small as you want ($10^{-30}$)	Machine precision $\approx 10^{-16}$

10.5 Jacobian Computation

For $f: \mathbb{R}^n \to \mathbb{R}^m$, the Jacobian $J \in \mathbb{R}^{m \times n}$ requires: - Forward/central: $n$ (or $2n$) function evaluations (perturb each input dimension) - Cost: $O(mn)$ per function evaluation → total $O(mn^2)$

This is why AD (Day 11) is preferred for large $n$.

Implementation (60 min)

Exercise 1: Error vs Step Size

# See: code/week02/numerical_differentiation.py

Plot error vs $h$ for forward, central, and complex-step on $f(x) = \sin(x)$. Show the V-shaped error curve.

Exercise 2: Find Optimal $h$ Experimentally

For each method, sweep $h$ from $10^{-1}$ to $10^{-16}$ and find the $h$ that minimizes error.

Exercise 3: Complex-Step Jacobian

Implement a complex-step Jacobian for vector functions $f: \mathbb{R}^n \to \mathbb{R}^m$.

Exercise 4: When Rounding Dominates

Test all three methods on $f(x) = x + 10^{-12}$ near $x=1$. The gradient is 1, but numerical methods struggle when $f(x)$ values are very close.

Practice (45 min)

Derive the truncation error for the central difference formula using Taylor expansion.
Why can't you use $h = 10^{-100}$ with forward differences but you can with complex-step?
What is the cost of computing a numerical Jacobian for $f: \mathbb{R}^{1000} \to \mathbb{R}^{50}$ using central differences?
Ceres NumericDiffCostFunction uses central differences by default. What step size does it use?
You derived a Jacobian by hand and want to verify it. Write the pseudocode for a gradient checker.
Why does the complex-step method fail on $f(x) = |x|$?
For a function that takes 1 second to evaluate with 500 parameters, how long would a numerical Jacobian take?
Higher-order finite differences: derive the 4th-order central difference formula. What's its optimal $h$?

Expert Challenges

🎯 Challenge 1: Automatic step size selection

**Problem:** Implement a gradient checker that automatically selects the optimal step size $h$ for each component by monitoring the error vs $h$ curve. Start with $h = 10^{-1}$ and halve until the error starts increasing (rounding regime). **Solution:** For each component $i$: compute forward differences with $h, h/2, h/4, \ldots$. The sequence of derivative estimates converges, then diverges as rounding takes over. The optimal $h$ is where the Richardson extrapolation error estimate $|D(h) - D(h/2)|$ is minimized. In practice, starting with $h = |x_i| \cdot 10^{-4}$ (or $10^{-4}$ if $x_i = 0$) and using central differences works well. Ceres uses `epsilon = sqrt(machine_epsilon) * max(abs(x_i), 1)`.

🎯 Challenge 2: Second derivatives via finite differences

**Problem:** Compute the Hessian numerically. (a) Use the formula $H_{ij} \approx \frac{f(x+h e_i+h e_j)-f(x+h e_i)-f(x+h e_j)+f(x)}{h^2}$. (b) What is its error order? (c) How many function evaluations for the full Hessian? (d) Can you use the complex-step trick? **Solution:** (a) This is the second-order forward difference for mixed partials. (b) Error is $O(h)$ — first-order only. For $O(h^2)$, use the central formula: $\frac{f(x+h e_i+h e_j)-f(x+h e_i-h e_j)-f(x-h e_i+h e_j)+f(x-h e_i-h e_j)}{4h^2}$. (c) Forward: $n(n+1)/2 + 1$ evaluations. Central: $2n(n+1)$ evaluations. (d) Complex-step can compute the Hessian using a "hyper-dual" number approach or by applying complex-step to the gradient (which itself can be computed analytically).

🎯 Challenge 3: Verifying a Ceres cost function Jacobian

**Problem:** You've written a Ceres cost function with an analytic Jacobian. Before deploying, you want to verify it. Design a test that: (1) evaluates the analytic Jacobian, (2) computes a numerical Jacobian, (3) reports the maximum relative error, (4) flags components where the error exceeds a threshold. **Solution:** In Ceres, use `ceres::GradientChecker` which does exactly this. In Python, the pattern is: for each parameter block, perturb each element by $h = \sqrt{\epsilon} \cdot \max(|x_i|, 1)$, compute central differences, compare to the analytic Jacobian. Report per-element relative errors $|J_a - J_n| / \max(|J_a|, |J_n|, 1)$. Flag if any exceed $10^{-4}$ (for central differences). This is the single most important debugging tool for optimization — a wrong Jacobian is the #1 cause of solver failure.

Self-Assessment Checklist

[ ] I can implement forward, central, and complex-step derivatives
[ ] I understand the truncation vs rounding tradeoff and optimal $h$
[ ] I can implement a gradient checker for vector functions
[ ] I know when numerical differentiation fails and why AD is preferred

Key Takeaways

Central difference ($O(h^2)$) is the workhorse — used by Ceres for gradient checking.
Complex-step eliminates cancellation error — use it when the function supports complex inputs.
Optimal $h$ exists because of the truncation-rounding tradeoff — $10^{-8}$ for forward, $10^{-5}$ for central.
Always verify analytic Jacobians numerically — this catches more bugs than any other practice.