01 — Mathematical Foundations for Optimization

Goal: Build the calculus and linear algebra vocabulary needed to read any optimizer's cost function, gradient, and convergence proof.

1. Vectors, Norms, and Inner Products

Vector Norms

A norm $\| \cdot \|$ measures the "size" of a vector.

Norm	Formula	Geometric meaning
$\ell_1$ (Manhattan)	$\\|x\\|_1 = \sum_i \\|x_i\\|$	Diamond in 2D
$\ell_2$ (Euclidean)	$\\|x\\|_2 = \sqrt{\sum_i x_i^2}$	Circle in 2D
$\ell_\infty$ (Chebyshev)	$\\|x\\|_\infty = \max_i \\|x_i\\|$	Square in 2D

Why it matters: The choice of norm changes what "closest" means. In robust estimation (Ceres), switching from $\ell_2$ to $\ell_1$-like norms (Huber/Cauchy loss) reduces sensitivity to outliers.

Inner Product and Angle

$$\langle x, y \rangle = x^T y = \|x\| \|y\| \cos\theta$$

Two vectors are orthogonal when $x^T y = 0$. Orthogonality shows up everywhere: - Gradient is orthogonal to level sets - Residual is orthogonal to column space in least-squares - Conjugate directions in CG solver

2. Gradient — The Direction of Steepest Ascent

For a scalar function $f : \mathbb{R}^n \to \mathbb{R}$, the gradient is:

$$\nabla f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}$$

Key properties: 1. $\nabla f(x)$ points in the direction of steepest ascent of $f$ 2. $-\nabla f(x)$ points in the direction of steepest descent — this is why gradient descent works 3. $\nabla f(x) = 0$ at any local minimum, maximum, or saddle point (stationary point)

Example: Quadratic Cost

$$f(x) = \frac{1}{2} x^T A x - b^T x$$

$$\nabla f(x) = Ax - b$$

Setting $\nabla f = 0$ gives $Ax = b$ — this is the normal equation for least-squares!

Jacobian (Vector-Valued Functions)

For $f : \mathbb{R}^n \to \mathbb{R}^m$, the Jacobian $J \in \mathbb{R}^{m \times n}$:

$$J_{ij} = \frac{\partial f_i}{\partial x_j}$$

In Ceres: CostFunction::Evaluate() returns both residuals $f(x)$ and Jacobians $J$.

3. Hessian — Curvature Information

The Hessian is the matrix of second derivatives:

$$H_{ij} = \nabla^2 f(x) = \frac{\partial^2 f}{\partial x_i \partial x_j}$$

Key properties: - $H$ is symmetric (assuming $f$ is twice continuously differentiable) - Eigenvalues of $H$ describe curvature in each direction - $H \succ 0$ (positive definite) ⟹ the point is a strict local minimum

Curvature Interpretation

Eigenvalues of $H$	Shape at the point	Optimization implication
All positive	Bowl (minimum)	Converged — stop here
All negative	Cap (maximum)	Ascent point — run away
Mixed signs	Saddle point	Not a minimum, keep going
Some near zero	Flat valley	Slow convergence, ill-conditioned

OKS relevance: When the sensorbar has too few measurements, the Hessian near the solution has near-zero eigenvalues → the solver oscillates or converges slowly.

4. Taylor Expansion — The Optimizer's Workhorse

Every optimization method is built on Taylor approximation:

First-order (linear model):

$$f(x + \Delta x) \approx f(x) + \nabla f(x)^T \Delta x$$

This is what gradient descent uses.

Second-order (quadratic model):

$$f(x + \Delta x) \approx f(x) + \nabla f(x)^T \Delta x + \frac{1}{2} \Delta x^T H \Delta x$$

This is what Newton's method uses. Setting the gradient of this quadratic to zero:

$$H \Delta x = -\nabla f(x)$$

This is the Newton step: $\Delta x = -H^{-1} \nabla f(x)$

5. Convexity — The Property That Makes Optimization Easy

Definition

A function $f$ is convex if for all $x, y$ and $\lambda \in [0,1]$:

$$f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda) f(y)$$

Geometrically: the line segment between any two points on the graph lies above the function.

Equivalent Conditions

Condition	Requires
$f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda) f(y)$	$f$ only
$f(y) \geq f(x) + \nabla f(x)^T (y-x)$	First derivative
$\nabla^2 f(x) \succeq 0$ (PSD Hessian)	Second derivative

Why Convexity Matters

For convex functions: - Every local minimum is a global minimum — no getting stuck in bad local optima - Gradient descent converges to the optimum (given appropriate step size) - Strong duality holds → can solve dual problem instead - Efficient algorithms exist with polynomial-time guarantees

Common Convex Functions

Function	Domain	Convex?	OKS use
$\\|x\\|_2^2$	$\mathbb{R}^n$	✅ Strongly convex	Least-squares cost
$\\|x\\|_1$	$\mathbb{R}^n$	✅ Convex (not strongly)	L1 regularization
$\max(x_1, \ldots, x_n)$	$\mathbb{R}^n$	✅ Convex	Minimax problems
$x \log x$	$\mathbb{R}_{++}$	✅ Convex	Entropy, KL divergence
$e^x$	$\mathbb{R}$	✅ Convex	—
$\sin(x)$	$\mathbb{R}$	❌ Non-convex	—

Composition Rules (Preserving Convexity)

Sum of convex functions → convex
$\max$ of convex functions → convex
Affine transformation: $f(Ax + b)$ is convex if $f$ is convex
Non-negative weighted sum: $\sum \alpha_i f_i(x)$, $\alpha_i \geq 0$ → convex

OKS relevance: The sum of squared residuals $\sum_i r_i(x)^2$ is convex in $x$ when each $r_i$ is affine. When $r_i$ is nonlinear (sensorbar model), the problem is non-convex — which is why Ceres needs good initial guesses and LM damping.

6. Optimality Conditions

Unconstrained

Condition	Statement	What it tells you
First-order necessary	$\nabla f(x^*) = 0$	Stationary point (could be min, max, or saddle)
Second-order necessary	$\nabla^2 f(x^*) \succeq 0$	Not a maximum (could still be saddle)
Second-order sufficient	$\nabla f(x^) = 0$ and $\nabla^2 f(x^) \succ 0$	Strict local minimum ✅

Constrained (Preview — detailed in 04)

For $\min f(x)$ subject to $g_i(x) \leq 0$ and $h_j(x) = 0$:

KKT conditions (first-order necessary, given constraint qualification): 1. Stationarity: $\nabla f(x^*) + \sum_i \lambda_i \nabla g_i(x^*) + \sum_j \nu_j \nabla h_j(x^*) = 0$ 2. Primal feasibility: $g_i(x^*) \leq 0$, $h_j(x^*) = 0$ 3. Dual feasibility: $\lambda_i \geq 0$ 4. Complementary slackness: $\lambda_i g_i(x^*) = 0$

7. Positive Definite Matrices — The Optimizer's Best Friend

A symmetric matrix $A$ is positive definite ($A \succ 0$) if:

$$x^T A x > 0 \quad \forall x \neq 0$$

Equivalent conditions: - All eigenvalues are positive - All leading minors are positive (Sylvester's criterion) - $A = L L^T$ for some lower-triangular $L$ with positive diagonal (Cholesky decomposition)

Why this matters for optimization: - The Hessian $H \succ 0$ guarantees a local minimum - Newton step $\Delta x = -H^{-1} g$ is a descent direction when $H \succ 0$ - In Gauss-Newton, $J^T J$ is always PSD (positive semi-definite) — adding $\lambda I$ in LM makes it PD - Cholesky factorization is the fastest way to solve $H \Delta x = -g$ when $H \succ 0$

8. Matrix Calculus Cheat Sheet

These identities appear constantly in optimization derivations:

Expression	Result
$\nabla_x (a^T x)$	$a$
$\nabla_x (x^T A x)$	$(A + A^T) x$ (or $2Ax$ if $A$ symmetric)
$\nabla_x \\|Ax - b\\|^2$	$2A^T(Ax - b)$
$\nabla_x \text{tr}(AXB)$	$A^T B^T$
$\frac{\partial}{\partial X} \log \det(X)$	$X^{-T}$

The third one — $\nabla_x \|Ax - b\|^2 = 2A^T(Ax - b)$ — is the gradient of the least-squares cost. Setting it to zero gives the normal equations: $A^T A x = A^T b$.

Summary: What to Carry Forward

Concept	Where it's used next
Gradient = steepest ascent	02-unconstrained (gradient descent)
Hessian = curvature	02-unconstrained (Newton's method)
Taylor quadratic model	02-unconstrained (trust region), 03-least-squares (Gauss-Newton)
Convexity → global minimum	05-convex-optimization (all of it)
PD matrices → Cholesky	03-least-squares (solving normal equations), 07-numerical
Normal equations	03-least-squares (the starting point)
KKT conditions	04-constrained (the entire chapter)