Day 24: Stochastic and Mini-Batch Methods
Phase II — Unconstrained Optimization | Week 4 | 2.5 hours ⚡ ENRICHMENT
SGD, Adam, and friends — ML's workhorses, briefly visiting robotics.
Navigation
OKS Relevance
While OKS robots primarily use deterministic optimization (Ceres, OSQP), stochastic methods appear in two key areas: (1) online parameter estimation where sensor data arrives as a stream and parameters must be updated incrementally, and (2) learning-based components like terrain classifiers or neural SLAM modules. Understanding SGD/Adam also builds intuition for adaptive step size selection.
Note: SGD/Adam are primarily ML tools. Robotics optimization rarely uses stochastic methods. Treat this as enrichment — skim if time-constrained, deep-dive if working on learning-based robotics.
Theory (45 min)
24.1 Stochastic Gradient Descent (SGD)
Given $f(x) = \frac{1}{N}\sum_{i=1}^N f_i(x)$, replace the full gradient with a stochastic estimate:
$$x_{k+1} = x_k - \alpha_k \nabla f_{i_k}(x)$$
where $i_k$ is randomly sampled.
Key property: $\mathbb{E}[\nabla f_{i_k}(x)] = \nabla f(x)$ — unbiased estimator.
Convergence rate: $O(1/\sqrt{k})$ for convex, $O(1/k)$ for strongly convex with decaying $\alpha_k = O(1/k)$.
24.2 Mini-Batch Gradient
Use a batch $B_k$ of size $b$:
$$x_{k+1} = x_k - \alpha_k \cdot \frac{1}{b}\sum_{i \in B_k} \nabla f_i(x_k)$$
Variance reduction: $\text{Var}[\text{estimate}] = \frac{\sigma^2}{b}$.
Trade-off: larger $b$ → lower variance but more compute per step.
24.3 Adaptive Methods
AdaGrad — adapt per-parameter:
$$x_{k+1,j} = x_{k,j} - \frac{\alpha}{\sqrt{G_{k,j} + \varepsilon}} g_{k,j}$$
where $G_{k,j} = \sum_{t=1}^k g_{t,j}^2$.
RMSProp — exponential moving average instead of full history:
$$v_{k,j} = \beta v_{k-1,j} + (1-\beta) g_{k,j}^2$$
$$x_{k+1,j} = x_{k,j} - \frac{\alpha}{\sqrt{v_{k,j} + \varepsilon}} g_{k,j}$$
Adam — RMSProp + momentum with bias correction:
$$m_k = \beta_1 m_{k-1} + (1-\beta_1) g_k \quad \text{(first moment)}$$
$$v_k = \beta_2 v_{k-1} + (1-\beta_2) g_k^2 \quad \text{(second moment)}$$
$$\hat{m}_k = \frac{m_k}{1 - \beta_1^k}, \quad \hat{v}_k = \frac{v_k}{1 - \beta_2^k}$$
$$x_{k+1} = x_k - \frac{\alpha}{\sqrt{\hat{v}_k} + \varepsilon} \hat{m}_k$$
Default: $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\varepsilon = 10^{-8}$, $\alpha = 10^{-3}$.
24.4 Learning Rate Schedules
| Schedule |
Formula |
When |
| Constant |
$\alpha_k = \alpha_0$ |
Simple, but won't converge to exact minimum |
| Step decay |
$\alpha_k = \alpha_0 \cdot \gamma^{\lfloor k/s \rfloor}$ |
Every $s$ steps, decay by $\gamma$ |
| Inverse |
$\alpha_k = \alpha_0 / (1 + dk)$ |
Theoretically optimal for convex |
| Cosine annealing |
$\alpha_k = \frac{\alpha_0}{2}(1 + \cos(\pi k / T))$ |
Popular in deep learning |
| Warm-up + decay |
Linearly increase, then decay |
Transformer training |
24.5 When Robotics Uses Stochastic Methods
- Online calibration: streaming sensor data → stochastic parameter update
- Neural network training: perception, learned cost functions
- Reinforcement learning: policy gradient methods are inherently stochastic
- Real-time estimation: particle filters use stochastic sampling
Implementation (60 min)
Exercise 1: SGD for Sum-of-Functions
# See: code/week04/stochastic_methods.py
Implement SGD for $f(x) = \frac{1}{N}\sum_i f_i(x)$ where each $f_i$ is a single data point's contribution.
Exercise 2: Mini-Batch GD
Implement mini-batch GD with configurable batch size. Show variance reduction as batch size increases.
Exercise 3: GD vs SGD vs Mini-Batch on Linear Regression
Generate $N = 10000$ samples, compare full GD, SGD, and mini-batch ($b = 32, 128, 512$) on linear regression.
Exercise 4: Learning Rate Schedules
Implement constant, $1/k$, and cosine annealing schedules. Show their effect on convergence.
Exercise 5: Adam from Scratch
Implement Adam optimizer with bias correction. Compare to SGD on a nonconvex toy problem.
Practice (45 min)
- When is SGD better than full-batch GD? (Compute budget: $N = 10^6$ and you can only do 100 passes.)
- Derive: why does mini-batch variance scale as $\sigma^2/b$?
- What is the optimal batch size for a given compute budget?
- Why does Adam often converge faster than SGD in early iterations but may not find as good a minimum?
- In robotics online estimation: if 100 sensor readings arrive per second, is it better to batch them or process one at a time?
- Why does AdaGrad's learning rate go to zero? How does RMSProp fix this?
- Compare Adam to L-BFGS: when would you prefer each?
- What is the "noise ball" — why does SGD with constant step size orbit the minimum?
Expert Challenges
🎯 Challenge 1: Variance reduction with SVRG
**Problem:** Implement Stochastic Variance-Reduced Gradient (SVRG): periodically compute the full gradient, then use it to reduce variance of stochastic updates. Show that SVRG converges linearly (like GD) while using stochastic updates (like SGD).
**Solution:** SVRG outer loop: compute $\tilde{\mu} = \nabla f(\tilde{x})$ at a reference point. Inner loop: $g_k = \nabla f_{i_k}(x_k) - \nabla f_{i_k}(\tilde{x}) + \tilde{\mu}$. The correction term $-\nabla f_{i_k}(\tilde{x}) + \tilde{\mu}$ ensures $\mathbb{E}[g_k] = \nabla f(x_k)$ with variance → 0 as $x_k \to \tilde{x}$. Converges linearly for strongly convex $f$.
🎯 Challenge 2: Adam convergence failure
**Problem:** Construct a simple 1D problem where Adam with default parameters converges to the *wrong* point (non-convergence of Adam, Reddi et al., 2018). Then implement AMSGrad (the fix) and show it converges correctly.
**Solution:** The classic counter-example alternates gradients: $g_k = C$ for $k \mod 3 \neq 1$ and $g_k = -\sqrt{C}$ otherwise, with $C > 1$. The running average in Adam can make the effective update direction wrong. AMSGrad fixes this by taking $\hat{v}_k = \max(\hat{v}_{k-1}, v_k)$ — it never decreases the denominator, ensuring convergence.
🎯 Challenge 3: Stochastic optimization for robot calibration
**Problem:** Simulate an online wheel calibration scenario: encoder ticks and true distances arrive in a stream. Implement an SGD-based online estimator for wheel radius. Compare to batch least squares (computing periodically with all data). Show that SGD with averaging converges to the same answer.
**Solution:** Model: $d_i = \pi r \cdot \text{ticks}_i + \varepsilon_i$. Loss: $\frac{1}{2}(d_i - \pi r \cdot \text{ticks}_i)^2$. SGD update: $r \leftarrow r - \alpha_k \cdot \text{ticks}_i \cdot \pi \cdot (d_i - \pi r \cdot \text{ticks}_i)$. With Polyak averaging ($\bar{r}_k = \frac{1}{k}\sum r_j$), this converges to the batch LS solution. Practical benefit: handles non-stationary wheel wear by weighting recent data more.
Self-Assessment Checklist
- [ ] I can implement SGD, mini-batch GD, and Adam from scratch
- [ ] I understand the variance-compute tradeoff in batch size selection
- [ ] I know when stochastic methods are appropriate in robotics (and when they're not)
- [ ] I can implement and compare different learning rate schedules
Key Takeaways
- SGD trades accuracy per step for speed per step — useful when $N$ is huge.
- Adam adapts per-parameter learning rates — the default choice for neural networks.
- Robotics rarely needs SGD — problems are usually small enough for batch methods.
- Online estimation is the main robotics use case for stochastic optimization ideas.