Day 6: Optimality Conditions
Phase I — Mathematical Foundations | Week 1 | 2.5 hours
Necessary conditions tell you "not here." Sufficient conditions tell you "definitely here."
Navigation
OKS Relevance
When Ceres or g2o reports "convergence," it means the optimality conditions are satisfied below a threshold. Understanding these conditions helps you diagnose: did the solver find a true minimum, or did it stop at a saddle point?
Theory (45 min)
6.1 First-Order Necessary Condition (FONC)
If $x^*$ is a local minimizer of $f$ and $f$ is differentiable:
$$\boxed{\nabla f(x^*) = 0}$$
Any minimizer must be a stationary point (zero gradient). But not every stationary point is a minimizer — it could be a maximum or saddle.
6.2 Second-Order Necessary Condition (SONC)
If $x^*$ is a local minimizer:
$$\nabla f(x^*) = 0 \quad \text{AND} \quad H(x^*) \succeq 0 \quad \text{(PSD)}$$
The Hessian can't have negative eigenvalues at a minimum (you'd be able to decrease $f$ along the corresponding eigenvector).
6.3 Second-Order Sufficient Condition (SOSC)
If $\nabla f(x^*) = 0$ and $H(x^*) \succ 0$ (PD), then $x^*$ is a strict local minimum.
| Condition |
$\nabla f = 0$ |
Hessian |
Conclusion |
| FONC (necessary) |
✓ |
— |
Must hold at any minimum |
| SONC (necessary) |
✓ |
PSD |
Must hold at any minimum |
| SOSC (sufficient) |
✓ |
PD |
Guarantees local minimum |
6.4 The Gap: Degenerate Cases
When $\nabla f = 0$ and $H$ is PSD (but not PD):
- $f(x) = x^4$ at $x = 0$: minimum despite $f''(0) = 0$
- $f(x) = x^3$ at $x = 0$: inflection point, NOT a minimum
Higher-order analysis is needed for degenerate cases.
6.5 Saddle Points
$x^*$ is a saddle point if $\nabla f(x^*) = 0$ and $H(x^*)$ is indefinite (has both positive and negative eigenvalues).
In high dimensions, saddle points vastly outnumber local minima. If $H$ has $n$ eigenvalues, there are $2^n$ sign combinations — most are saddle-type.
6.6 For Convex Functions
For convex $f$:
- FONC ($\nabla f = 0$) is both necessary AND sufficient for global optimality
- No need to check the Hessian
- This is the power of convexity
Implementation (60 min)
Exercise 1: Finding and Classifying Stationary Points
# See: code/week01/optimality_conditions.py
Find all stationary points of a 2D function, classify each.
Exercise 2: Saddle Point Detection
Identify and visualize saddle points — show descent directions (negative curvature eigenvectors).
Exercise 3: Optimality Diagnostics
Given a "solution" from an optimizer, verify the optimality conditions and report:
- Gradient norm (should be ~0)
- Hessian eigenvalues (should be ≥0)
- Condition number (large = poorly conditioned)
Exercise 4: Escape from Saddle Points
Demonstrate that adding noise to gradient descent helps escape saddle points.
Practice (45 min)
For each function, find all stationary points and classify them:
- $f(x) = x^3 - 3x$ → two stationary points, classify each
- $f(x,y) = x^2 + y^2 - 2x - 4y + 5$ → complete the square
- $f(x,y) = x^3 + y^3 - 3xy$ → three stationary points
- $f(x,y) = \sin(x)\sin(y)$ on $[0, 2\pi]^2$ → multiple, many saddles
- $f(x,y) = x^2 y^2$ → degenerate Hessian at origin
- $f(x) = x^4 - 2x^2 + 1$ → minima and maximum
- $f(x,y,z) = x^2 + y^2 - z^2$ → saddle at origin
- $f(x) = \|Ax - b\|^2$ where $A$ is rank-deficient → infinitely many minimizers
Expert Challenges
🎯 Challenge 1: Convergence criteria in Ceres
**Problem:** Ceres Solver reports convergence based on three thresholds: (1) gradient tolerance, (2) relative function change, (3) relative parameter change. Map each to the optimality conditions. Which is the most theoretically justified?
**Solution:** (1) Gradient tolerance checks FONC: $\|\nabla f\| < \epsilon_g$. This is the most theoretically justified — directly checks stationarity. (2) Function change $|f_{k} - f_{k-1}|/|f_k| < \epsilon_f$ checks if progress has stalled, but could trigger at a saddle. (3) Parameter change $\|\Delta x\| < \epsilon_x$ also checks stalling. In practice, (1) is primary but all three are needed — (2) and (3) catch slow convergence that (1) might miss due to numerical noise.
🎯 Challenge 2: Saddle points in high dimensions
**Problem:** For a random quadratic $f(x) = \frac{1}{2}x^T A x$ where $A$ has entries drawn from $\mathcal{N}(0,1)$, what fraction of eigenvalues are positive? What does this imply about random non-convex problems?
**Solution:** For a GOE (Gaussian Orthogonal Ensemble) random symmetric matrix, eigenvalues follow the Wigner semicircle law centered at 0. About half are positive, half negative. So a random quadratic in $\mathbb{R}^n$ is a saddle point with probability approaching 1 as $n \to \infty$ — there are $2^n$ possible sign patterns but only 1 gives PD (minimum). In neural networks ($n \sim 10^6$), local minima are exponentially rare compared to saddle points. This is why SGD with momentum works — it naturally escapes saddle points.
🎯 Challenge 3: When SOSC fails but it's still a minimum
**Problem:** Show that $f(x,y) = x^4 + y^4$ has a minimum at the origin, but SOSC fails. What additional analysis proves it?
**Solution:** $\nabla f(0,0) = 0$ ✓. $H(0,0) = \text{diag}(0, 0)$ — PSD but not PD, so SOSC doesn't apply. However, for all $(x,y) \neq (0,0)$: $f(x,y) = x^4 + y^4 > 0 = f(0,0)$. Direct verification shows it's a global minimum. The issue is that SOSC is sufficient but not necessary — the function is "flatter than quadratic" at the origin. Fourth-order analysis: the 4th-order terms $x^4 + y^4$ are positive definite, confirming the minimum.
Self-Assessment Checklist
- [ ] I can state FONC, SONC, and SOSC precisely
- [ ] I can find all stationary points and classify them (minimum, maximum, saddle)
- [ ] I understand the gap between necessary and sufficient conditions
- [ ] I can interpret solver convergence diagnostics in terms of optimality conditions
Key Takeaways
- FONC ($\nabla f = 0$): necessary at any minimum, but saddle points also satisfy it.
- SOSC ($\nabla f = 0$, $H \succ 0$): guarantees a local minimum. The gold standard.
- For convex functions, FONC alone is sufficient — the simplicity that makes optimization tractable.
- In practice, check gradient norm + Hessian eigenvalues to verify your solution.