Matrix Essentials for Optimization

A comprehensive reference — matrix types, criteria, decompositions, and their role in optimization.

This guide consolidates and extends the matrix material from Weeks 1–2 into a single, deep reference. Use it as a companion to the daily lessons or as a standalone study guide.

Prerequisites

1. Matrix Taxonomy

Not all matrices are created equal. Recognizing the type of matrix you're working with immediately tells you which algorithms, decompositions, and guarantees apply.

1.1 Structural Classification

Type	Definition	Key Property	Example in Robotics
Symmetric	$A = A^T$	Real eigenvalues, orthogonal eigenvectors	Hessian $\nabla^2 f$, covariance $\Sigma$
Skew-symmetric	$A = -A^T$	Eigenvalues are purely imaginary	Cross-product matrix $[\omega]_\times$
Orthogonal	$A^T A = I$	Preserves norms and angles, $\det = \pm 1$	Rotation matrices $R \in SO(3)$
Diagonal	$a_{ij} = 0$ for $i \neq j$	Eigenvalues = diagonal entries	Scaling, singular value matrix $\Sigma$
Triangular	$a_{ij} = 0$ above (or below) diagonal	$O(n^2)$ solve via substitution	Cholesky factor $L$
Band	$a_{ij} = 0$ for $\|i-j\| > k$	$O(nk^2)$ factorization	Tridiagonal systems in 1D optimization
Sparse	Most entries are zero	Specialized storage and algorithms	SLAM Jacobians, graph Laplacians
Block	Partitioned into sub-matrices	Exploit structure for efficiency	Bundle adjustment Hessian

1.2 Spectral Classification

Type	Eigenvalue Condition	Geometric Shape	Optimization Meaning
Positive definite (PD)	All $\lambda_i > 0$	Ellipsoidal bowl	Unique minimum exists
Positive semidefinite (PSD)	All $\lambda_i \geq 0$	Bowl with flat valley	Minimum exists but may not be unique
Indefinite	Mixed $\lambda_i$ signs	Saddle surface	Saddle point — not a minimum
Negative definite (ND)	All $\lambda_i < 0$	Inverted bowl	Maximum (minimize $-f$)
Negative semidefinite (NSD)	All $\lambda_i \leq 0$	Inverted bowl + flat	Maximum with degeneracy

1.3 The Definiteness Decision Tree

Is A symmetric?
├── No → Not applicable (definiteness is defined for symmetric matrices)
└── Yes → Compute eigenvalues
    ├── All λ > 0 → Positive Definite
    ├── All λ ≥ 0, some = 0 → Positive Semidefinite
    ├── All λ < 0 → Negative Definite
    ├── All λ ≤ 0, some = 0 → Negative Semidefinite
    └── Mixed signs → Indefinite

2. Positive Definiteness — The Central Criterion

In optimization, PD matrices appear everywhere: the Hessian at a minimum is PD, information matrices in SLAM are PD, and the convergence rate of most solvers depends on the eigenvalue spectrum of a PD matrix.

2.1 Six Tests for Positive Definiteness

Given a symmetric matrix $A \in \mathbb{R}^{n \times n}$:

#	Test	Computational Cost	Notes
1	Eigenvalue test: All $\lambda_i > 0$	$O(n^3)$	Definitive but expensive
2	Cholesky test: $A = LL^T$ succeeds	$O(n^3/3)$	Fastest in practice, also gives solver
3	Sylvester's criterion: All leading principal minors $> 0$	$O(n^4)$ naive	Good for small matrices and proofs
4	Quadratic form: $x^TAx > 0$ for all $x \neq 0$	Not computational	The definition; used in proofs
5	Pivot test: All pivots in Gaussian elimination are positive	$O(n^3/3)$	Equivalent to Cholesky
6	Diagonal dominance (sufficient, not necessary): $a_{ii} > \sum_{j \neq i} \|a_{ij}\|$	$O(n^2)$	Quick sufficient check

Practical recommendation: Try Cholesky (#2). If it succeeds, the matrix is PD. If it fails (negative value under square root), the matrix is not PD. This is also the fastest method.

2.2 Sylvester's Criterion — Worked Example

A matrix is PD if and only if all leading principal minors are positive.

For $A = \begin{pmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \end{pmatrix}$:

$\Delta_1 = a_{11} > 0$
$\Delta_2 = a_{11}a_{22} - a_{12}a_{21} > 0$
$\Delta_3 = \det(A) > 0$

Example: $A = \begin{pmatrix} 4 & 2 & 1 \\ 2 & 5 & 3 \\ 1 & 3 & 6 \end{pmatrix}$

$\Delta_1 = 4 > 0$ ✓
$\Delta_2 = 4 \cdot 5 - 2 \cdot 2 = 16 > 0$ ✓
$\Delta_3 = 4(30 - 9) - 2(12 - 3) + 1(6 - 5) = 84 - 18 + 1 = 67 > 0$ ✓

Therefore $A$ is positive definite.

2.3 The PSD Cone and Matrix Inequalities

The set of all $n \times n$ PSD matrices forms a convex cone $\mathcal{S}_+^n$:

If $A, B \succeq 0$ then $\alpha A + \beta B \succeq 0$ for $\alpha, \beta \geq 0$ (closure under conic combinations)
$0 \in \mathcal{S}_+^n$ (contains the origin)

Löwner (matrix) ordering: $A \succeq B$ means $A - B$ is PSD. This defines a partial order on symmetric matrices.

Key inequalities: - $A \succeq 0 \iff x^TAx \geq 0$ for all $x$ - $A \succeq B$ and $B \succeq C \implies A \succeq C$ (transitivity) - $A \succeq 0 \implies \text{tr}(A) \geq 0$ and $\det(A) \geq 0$ - If $A \succ 0$ then $A^{-1} \succ 0$

Optimization relevance: Semidefinite programming (SDP) optimizes over the PSD cone. The constraint "$H \succeq 0$" is convex, enabling efficient algorithms.

2.4 Cholesky Factorization — How It Works

Cholesky decomposes a PD matrix into $A = LL^T$ where $L$ is lower triangular with positive diagonal entries. The formulas follow directly from matching entries:

$$L_{jj} = \sqrt{A_{jj} - \sum_{k=1}^{j-1} L_{jk}^2}, \qquad L_{ij} = \frac{1}{L_{jj}}\left(A_{ij} - \sum_{k=1}^{j-1} L_{ik}L_{jk}\right) \quad (i > j)$$

Worked example: For $A = \begin{pmatrix} 4 & 2 & 1 \\ 2 & 5 & 3 \\ 1 & 3 & 6 \end{pmatrix}$:

$L_{11} = \sqrt{4} = 2$
$L_{21} = 2/2 = 1$, $L_{31} = 1/2 = 0.5$
$L_{22} = \sqrt{5 - 1^2} = \sqrt{4} = 2$
$L_{32} = (3 - 0.5 \cdot 1)/2 = 2.5/2 = 1.25$
$L_{33} = \sqrt{6 - 0.5^2 - 1.25^2} = \sqrt{6 - 0.25 - 1.5625} = \sqrt{4.1875} \approx 2.046$

$$L = \begin{pmatrix} 2 & 0 & 0 \\ 1 & 2 & 0 \\ 0.5 & 1.25 & 2.046 \end{pmatrix}$$

Why Cholesky is preferred: Half the flops of LU ($n^3/3$ vs $2n^3/3$), and it's numerically stable without pivoting. If the factorization fails (negative value under the square root at step $j$), the matrix is not PD — this is the PD test.

2.5 Schur Complement Criterion

For a block matrix $M = \begin{pmatrix} A & B \\ B^T & C \end{pmatrix}$:

$$M \succ 0 \iff A \succ 0 \text{ and } C - B^T A^{-1} B \succ 0$$

The matrix $S = C - B^T A^{-1} B$ is the Schur complement of $A$ in $M$.

Why this matters: In bundle adjustment, the Hessian has block structure $\begin{pmatrix} H_{cc} & H_{cl} \\ H_{lc} & H_{ll} \end{pmatrix}$. The Schur complement eliminates landmarks, leaving a smaller camera-only system. Positive definiteness of $S$ guarantees the reduced system is well-posed.

2.5 Nearest Positive Definite Matrix

Given a symmetric matrix $A$ that is not PD (e.g., a noisy Hessian estimate), find the closest PD matrix.

Algorithm (Higham, 1988):

Compute eigendecomposition: $A = Q \Lambda Q^T$
Replace negative eigenvalues: $\tilde{\Lambda} = \text{diag}(\max(\lambda_i, \epsilon))$
Reconstruct: $\tilde{A} = Q \tilde{\Lambda} Q^T$

This gives the nearest PSD matrix in the Frobenius norm. Adding a small $\epsilon > 0$ to the zero eigenvalues makes it PD.

Optimization use: When Newton's method produces an indefinite Hessian, this (or adding $\lambda I$, or modified Cholesky) makes the search direction a descent direction.

3. SVD Masterclass

The Singular Value Decomposition is the most important matrix factorization. It applies to any matrix (no restrictions on shape or symmetry) and reveals rank, range, null space, condition number, and best low-rank approximation.

Notation: We use $\Sigma$ for both the SVD diagonal matrix and covariance matrices — the meaning is always clear from context. Some authors write $S$ or $D$ for the SVD diagonal to avoid ambiguity.

3.1 Full Derivation

For any $A \in \mathbb{R}^{m \times n}$ with rank $r$:

$$A = U \Sigma V^T$$

Construction: 1. Form $A^TA$ (symmetric PSD, $n \times n$). Its eigenvalues are $\sigma_1^2 \geq \cdots \geq \sigma_r^2 > 0 = \cdots = 0$. 2. Eigenvectors of $A^TA$ give $V = [v_1 | \cdots | v_n]$ (right singular vectors). 3. $\sigma_i = \sqrt{\lambda_i(A^TA)}$ are the singular values. 4. $u_i = Av_i / \sigma_i$ for $i = 1, \ldots, r$ give the left singular vectors. 5. Complete $U$ with an orthonormal basis for $N(A^T)$.

3.2 Geometric Interpretation

SVD decomposes any linear transformation into three steps:

$$A = \underbrace{U}_{\text{rotate output}} \cdot \underbrace{\Sigma}_{\text{stretch}} \cdot \underbrace{V^T}_{\text{rotate input}}$$

$V^T$ rotates the input to align with the principal axes
$\Sigma$ stretches along each axis by $\sigma_i$
$U$ rotates the result into the output space

The unit sphere $\|x\| = 1$ maps to an ellipsoid with semi-axes $\sigma_1, \ldots, \sigma_r$ oriented along $u_1, \ldots, u_r$.

3.3 Four Forms of SVD

Form	$U$	$\Sigma$	$V$	Use
Full	$m \times m$	$m \times n$	$n \times n$	Theoretical completeness
Thin/Economy	$m \times n$	$n \times n$	$n \times n$	Most computations
Compact	$m \times r$	$r \times r$	$n \times r$	Rank-$r$ representation
Truncated	$m \times k$	$k \times k$	$n \times k$	Best rank-$k$ approximation

NumPy: - np.linalg.svd(A) → full SVD - np.linalg.svd(A, full_matrices=False) → thin SVD

3.4 The Pseudoinverse (Moore-Penrose)

For $A = U\Sigma V^T$, the pseudoinverse is:

$$A^+ = V \Sigma^+ U^T$$

where $\Sigma^+$ has entries $1/\sigma_i$ for $\sigma_i > 0$ and $0$ otherwise.

Properties: - $AA^+A = A$ and $A^+AA^+ = A^+$ (generalized inverse conditions) - $x^+ = A^+ b$ is the minimum-norm least-squares solution to $Ax = b$ - If $A$ is invertible: $A^+ = A^{-1}$ - If $A$ has full column rank: $A^+ = (A^TA)^{-1}A^T$ - If $A$ has full row rank: $A^+ = A^T(AA^T)^{-1}$

NumPy: np.linalg.pinv(A) computes $A^+$ via SVD.

3.5 Eckart-Young-Mirsky Theorem

The best rank-$k$ approximation to $A$ (in both Frobenius and spectral norms) is:

$$A_k = \sum_{i=1}^{k} \sigma_i u_i v_i^T = U_k \Sigma_k V_k^T$$

Error bounds: - Spectral norm: $\|A - A_k\|_2 = \sigma_{k+1}$ - Frobenius norm: $\|A - A_k\|_F = \sqrt{\sigma_{k+1}^2 + \cdots + \sigma_r^2}$

Applications: - Data compression: Keep top-$k$ singular values, discard noise in the small ones - Noise filtering: If true signal has rank $k$, truncated SVD removes noise - PCA: The top-$k$ left singular vectors of centered data $X$ are the principal components - Latent Semantic Analysis: Low-rank approximation of term-document matrices

3.6 Numerical Rank

In theory, rank = number of nonzero singular values. In practice, due to floating-point noise, we define:

$$\text{numerical rank} = \#\{i : \sigma_i > \tau\}$$

where $\tau$ is a threshold. Common choices: - $\tau = \epsilon_{\text{mach}} \cdot \sigma_1 \cdot \max(m, n)$ (NumPy default for np.linalg.matrix_rank) - $\tau = \epsilon_{\text{mach}} \cdot \sigma_1$ (more aggressive)

Robotics connection: In SLAM, the numerical rank of the Jacobian reveals how many state dimensions are actually observable given the current measurements.

4. Matrix Norms

Norms measure the "size" of a matrix. Different norms answer different questions.

4.1 Vector Norms (Review)

Norm	Definition	Geometric Meaning
$\\|x\\|_1 = \sum_i \|x_i\|$	Manhattan / taxicab distance	Diamond-shaped unit ball
$\\|x\\|_2 = \sqrt{\sum_i x_i^2}$	Euclidean distance	Circular unit ball
$\\|x\\|_\infty = \max_i \|x_i\|$	Chebyshev distance	Square unit ball
$\\|x\\|_p = \left(\sum_i \|x_i\|^p\right)^{1/p}$	General $p$-norm	Interpolates between above

4.2 Matrix Norms

Norm	Definition	Computation
Frobenius	$\\|A\\|_F = \sqrt{\sum_{ij} a_{ij}^2} = \sqrt{\text{tr}(A^TA)}$	$= \sqrt{\sum_i \sigma_i^2}$
Spectral (2-norm)	$\\|A\\|_2 = \max_{\\|x\\|=1} \\|Ax\\|_2$	$= \sigma_{\max}(A)$
Nuclear (trace)	$\\|A\\|_* = \sum_i \sigma_i$	Sum of singular values
Max norm	$\\|A\\|_{\max} = \max_{ij} \|a_{ij}\|$	Largest absolute entry

Relationships: $\|A\|_2 \leq \|A\|_F \leq \sqrt{r} \cdot \|A\|_2$ where $r = \text{rank}(A)$.

4.3 Induced (Operator) Norms

The matrix $p$-norm is the maximum stretching factor:

$$\|A\|_p = \max_{x \neq 0} \frac{\|Ax\|_p}{\|x\|_p}$$

$p$	Formula
$p = 1$	$\\|A\\|_1 = \max_j \sum_i \|a_{ij}\|$ (max column sum)
$p = 2$	$\\|A\\|_2 = \sigma_{\max}(A)$ (largest singular value)
$p = \infty$	$\\|A\\|_\infty = \max_i \sum_j \|a_{ij}\|$ (max row sum)

4.4 Submultiplicativity

For any consistent matrix norm: $\|AB\| \leq \|A\| \cdot \|B\|$.

Consequence: $\|A^k\| \leq \|A\|^k$. This bounds the error growth in iterative methods.

4.5 Norms in Optimization

Frobenius norm → measures overall "size" of a matrix; used in regularization ($\min \|X\|_F$)
Spectral norm → measures worst-case amplification; determines condition number
Nuclear norm → convex relaxation of rank; used in matrix completion / low-rank recovery
$\ell_1$ on singular values = nuclear norm → promotes low rank (like $\ell_1$ promotes sparsity)

Motivating example (nuclear norm): In the Netflix Prize, the user-movie ratings matrix is mostly missing entries. The nuclear norm $\|X\|_*$ serves as a convex proxy for rank: minimizing $\|X\|_*$ subject to matching known ratings recovers a low-rank completion — meaning users' preferences lie in a low-dimensional space of "taste profiles."

5. Ill-Conditioning — Diagnosis and Remedies

5.1 Condition Number Revisited

$$\kappa(A) = \|A\| \cdot \|A^{-1}\| = \frac{\sigma_{\max}}{\sigma_{\min}}$$

Rules of thumb:

$\kappa(A)$	Digits Lost	Status	Action
$1$	$0$	Perfect	None needed
$10^3$	$\sim 3$	Moderate	Monitor
$10^6$	$\sim 6$	Ill-conditioned	Precondition or regularize
$10^{12}$	$\sim 12$	Severely ill-conditioned	Regularize; direct solve unreliable
$> 10^{16}$	All	Effectively singular	Rank-deficient; use pseudoinverse

Forward error bound: $\frac{\|\delta x\|}{\|x\|} \leq \kappa(A) \cdot \frac{\|\delta b\|}{\|b\|}$

If you store numbers with $d$ digits of precision, and $\kappa(A) \approx 10^k$, then the solution has at most $d - k$ reliable digits.

5.2 Sources of Ill-Conditioning in Robotics

Source	Why It Happens	Typical $\kappa$
Scale mismatch	Translation (meters) vs rotation (radians) in same state	$10^2 - 10^4$
Near-singularity	Near-collinear landmarks, degenerate configurations	$10^6 - 10^{12}$
Gauge freedom	Absolute position/orientation unobservable in SLAM	$\infty$ (rank-deficient)
Poor excitation	Parameters that barely affect measurements	$10^4 - 10^8$
Large dynamic range	Nearby + distant features in same BA problem	$10^3 - 10^6$
Normal equations	Forming $J^TJ$ squares the condition number	$\kappa(J)^2$

5.3 Catastrophic Cancellation — A Worked Example

Consider solving:

$$\begin{pmatrix} 1 & 1 \\ 1 & 1 + 10^{-15} \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \end{pmatrix} = \begin{pmatrix} 2 \\ 2 + 10^{-15} \end{pmatrix}$$

True solution: $x = [1, 1]^T$. Condition number $\kappa \approx 4 \times 10^{15}$.

In double precision ($\epsilon_{\text{mach}} \approx 10^{-16}$): the solver produces wildly wrong results because we lose all significant digits. Tiny changes to the RHS produce completely different solutions.

5.4 Regularization Techniques

When the problem is ill-conditioned or rank-deficient, regularization adds prior information to stabilize the solution.

Tikhonov Regularization (Ridge Regression)

$$\min_x \|Ax - b\|^2 + \lambda \|x\|^2$$

Solution: $x_\lambda = (A^TA + \lambda I)^{-1} A^T b$

Effect: Condition number becomes $\kappa(A^TA + \lambda I) = \frac{\sigma_1^2 + \lambda}{\sigma_n^2 + \lambda}$

For large $\lambda$: $\kappa \to 1$ (well-conditioned but biased). For small $\lambda$: $\kappa \to \kappa(A^TA)$ (unbiased but ill-conditioned).

Robotics analogy: Levenberg-Marquardt adds $\lambda I$ to $J^TJ$ — this IS Tikhonov regularization of the Gauss-Newton step.

Truncated SVD

$$x_k = \sum_{i=1}^{k} \frac{u_i^T b}{\sigma_i} v_i$$

Keep only the $k$ largest singular values/vectors. Small $\sigma_i$ amplify noise, so discarding them improves stability at the cost of a biased solution.

When to use which: - Tikhonov when all parameters are meaningful but noisy → smooth damping - Truncated SVD when some parameters are genuinely unobservable → hard cutoff - Both combined (damped truncated SVD): discard the worst, damp the rest

L-Curve Method for Choosing $\lambda$

Plot $\|\text{residual}\|$ vs $\|\text{solution}\|$ as $\lambda$ varies. The optimal $\lambda$ is at the "corner" of the L-shaped curve — balancing data fit against solution regularity.

5.5 Preconditioning Strategies

Transform $Hx = g$ into $\tilde{H}\tilde{x} = \tilde{g}$ with $\kappa(\tilde{H}) \ll \kappa(H)$.

Preconditioner	Construction	Cost	Best For
Jacobi	$M = \text{diag}(H)$	$O(n)$	Scale-mismatched problems
Block Jacobi	$M = \text{blockdiag}(H)$	$O(n \cdot b^3)$	Block-structured Hessians
SSOR	Symmetric SOR based on $H = D + L + L^T$	$O(\text{nnz})$	General sparse
Incomplete Cholesky	Sparse $\tilde{L}\tilde{L}^T \approx H$	$O(\text{nnz})$	Sparse PD systems
AMG	Algebraic multigrid hierarchy	$O(n)$ apply	Very large sparse systems

6. Special Matrices in Robotics & Optimization

6.1 Rotation Matrices — $SO(2)$ and $SO(3)$

$SO(2)$: 2D rotation by angle $\theta$: $$R(\theta) = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix}$$

Properties: $R^TR = I$, $\det(R) = +1$, eigenvalues $= e^{\pm i\theta}$, $\kappa(R) = 1$.

$SO(3)$: 3D rotation parameterized by axis-angle $\omega = \theta \hat{n}$: $$R = I + \sin\theta [\hat{n}]_\times + (1 - \cos\theta)[\hat{n}]_\times^2$$ (Rodrigues' formula)

where $[\hat{n}]_\times$ is the $3 \times 3$ skew-symmetric cross-product matrix.

Key fact: Rotation matrices are always perfectly conditioned ($\kappa = 1$) and orthogonal. They never cause numerical trouble by themselves — trouble arises when rotation parameters are mixed with translation parameters of very different scale.

6.2 Covariance Matrices

$$\Sigma = \mathbb{E}[(x - \mu)(x - \mu)^T]$$

Properties: Always symmetric PSD. Diagonal entries = variances. Off-diagonal = covariances.

Eigenvalue interpretation: - Eigenvectors = principal directions of uncertainty - Eigenvalues = variance along each principal direction - $\sqrt{\lambda_i}$ = standard deviation along $i$-th principal direction - The uncertainty ellipsoid is $\{x : (x-\mu)^T \Sigma^{-1} (x-\mu) \leq c^2\}$

6.3 Information Matrices (Precision Matrices)

$$\Lambda = \Sigma^{-1}$$

Properties: PD (if $\Sigma$ is PD). In SLAM, $\Lambda$ is sparse because only spatially adjacent variables share measurements, while $\Sigma$ is typically dense.

Why SLAM uses information form: - Sparse $\Lambda$ → efficient storage and computation - Adding a new measurement = adding to $\Lambda$ (additive update) - Marginalizing a variable = Schur complement (fill-in, but manageable)

6.4 Jacobian Matrices

$$J_{ij} = \frac{\partial f_i}{\partial x_j}$$

Properties: Generally rectangular ($m \times n$, often $m > n$). Not necessarily symmetric.

What the SVD of $J$ tells you: - Rank = number of independently constrained state dimensions - Null space ($V$ columns for $\sigma_i = 0$) = unobservable directions - Condition number $\kappa(J)$ = how sensitive the solution is to measurement noise - Left null space ($U$ columns for $\sigma_i = 0$) = redundant measurements

6.5 Hessian Matrices

$$H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}$$

Properties: Always symmetric (by Schwarz's theorem if $f$ is $C^2$). At a local minimum: $H \succeq 0$ (PSD). At a strict local minimum of a strongly convex function: $H \succ 0$ (PD).

Gauss-Newton approximation: For least-squares $f(x) = \frac{1}{2}\|r(x)\|^2$: $$H \approx J^T J$$ (drops second-order residual terms). This approximation is PSD by construction, even if the true Hessian is indefinite.

6.6 Graph Laplacian

For a graph with $n$ nodes, weighted adjacency $W$, and degree matrix $D = \text{diag}(\sum_j w_{ij})$:

$$L = D - W$$

Properties: - Symmetric, PSD - Smallest eigenvalue is always $0$ (eigenvector = $\mathbf{1}$, the constant vector) - Number of zero eigenvalues = number of connected components - Second-smallest eigenvalue (Fiedler value) = algebraic connectivity

Optimization use: Pose-graph SLAM can be viewed as minimizing a quadratic form on a graph Laplacian. Spectral methods use the Laplacian eigenvectors for initialization.

7. Matrix Calculus Quick Reference

7.1 Common Derivatives

Expression	Derivative w.r.t. $x$
$a^Tx$	$a$
$x^TAx$	$(A + A^T)x$ (or $2Ax$ if $A$ symmetric)
$\\|Ax - b\\|^2$	$2A^T(Ax - b)$
$\text{tr}(AX)$	$A^T$ (w.r.t. $X$)
$\log\det(X)$	$X^{-T}$ (w.r.t. $X$, for PD $X$)
$x^TA^{-1}x$	$-A^{-T}xx^TA^{-T}$ (w.r.t. $A$)

7.2 The Chain Rule for Matrix-Valued Functions

If $f : \mathbb{R}^n \to \mathbb{R}$ is composed as $f = g \circ h$ where $h : \mathbb{R}^n \to \mathbb{R}^m$:

$$\nabla_x f = J_h^T \nabla_y g$$

This is the foundation of backpropagation and the adjoint method in automatic differentiation.

8. Summary — The Matrix Criteria Cheat Sheet

When You Encounter a Matrix, Ask:

Is it symmetric? → Eigendecomposition exists; definiteness is meaningful
Is it PD? → Try Cholesky. If yes: unique minimum, use Cholesky solver
What's the condition number? → Predicts convergence speed and accuracy
What's the rank? → SVD reveals observable vs unobservable directions
Is it sparse? → Use sparse decompositions, iterative methods
Does it have block structure? → Exploit with Schur complement

Algorithm Selection Flowchart

Problem: Solve Hx = g (where H is your Hessian)

Is H symmetric?
├── No → Use LU decomposition (O(2n³/3))
└── Yes
    ├── Is H positive definite?
    │   ├── Yes → Use Cholesky (O(n³/3)) ← FASTEST
    │   └── No / Unknown
    │       ├── Is H PSD? → Use LDLᵀ (handles zeros)
    │       └── Is H indefinite? → Modified Cholesky or add λI
    └── Is H sparse?
        ├── Yes
        │   ├── PD? → Sparse Cholesky with ordering (CHOLMOD)
        │   └── General? → Sparse LU (UMFPACK) or CG with preconditioner
        └── No → Dense decomposition as above

Exercises

See Exercise Set 7: Matrix Mastery for 30 graded problems covering all topics in this guide — from hand computation through Python implementation to real-world robotics scenarios.