Phase II — Attention, Transformers & Scaling | Week 4 | 2.5 hours "The same architecture, more data, more compute, keeps getting better along a predictable power law. No architecture changes needed. It just keeps going. What does this tell us about the nature of learning?"
This is a reflection day. No new implementation. No new architecture. Instead, you consolidate the profound implications of what you've learned in the last two weeks — from attention mechanisms through scaling laws — and connect it back to your ultimate goal: robot intelligence.
Format: Write in a notebook (physical or digital). Spend real time thinking, not just reading.
Language models scale because: - Text data is essentially free (the internet) - Loss function is clear (next-token prediction) - Compute is available (buy more GPUs)
Robot learning faces a different world: - Data costs ~$100/hour to collect (physical hardware, human operators) - The "right" loss function is unclear (what's the next-action equivalent of next-token prediction?) - Sim-to-real gap makes simulated data less reliable
Write 500+ words answering: "If scaling laws hold for robotic action prediction, what would a Chinchilla-optimal VLA look like? If they don't hold, why not, and what does that mean for the field?"
Consider these dimensions:
| Dimension | Language | Robotics |
|---|---|---|
| Data cost | ~$0 per token | ~$100 per hour of robot data |
| Data diversity | Billions of authors, topics | Tens of labs, hundreds of tasks |
| Token semantics | Discrete, well-defined | Continuous actions, high-dimensional |
| Loss function | Cross-entropy on next token | MSE on actions? Diffusion loss? |
| Evaluation | Perplexity, benchmarks | Real-world task success rate |
Go back to your Day 5 notes on information theory and compression. Re-read them with fresh eyes.
Day 5 established: - Cross-entropy loss = bits needed to encode data under the model - Better models = better compressors - Rate-distortion theory: there's a fundamental tradeoff between compression and fidelity
Day 25 showed: - Scaling laws: $L(C) \propto C^{-\alpha}$ - More compute → lower loss → fewer bits → better compression
The synthesis: Power-law scaling of loss = power-law improvement in compression efficiency. Each order of magnitude of compute buys a fixed percentage improvement in compression. The model isn't just memorizing — it's discovering increasingly deep structure.
$$\text{Compression ratio} \propto C^{\alpha}$$
Write 300+ words connecting: "How does the information-theoretic view of learning (Day 5) illuminate what scaling laws (Day 25) are actually measuring?"
Key threads to weave: - Shannon's source coding theorem and the fundamental limit of compression - Are scaling laws approaching Shannon's limit? How would we know? - If a VLA compresses robot experience into a compact model, what "structure" is it discovering?
Read (or re-read) The Bitter Lesson.
Core claim: General methods that leverage computation are ultimately the most effective, and by a large margin. Hand-crafted, human-knowledge-based approaches always lose to simpler methods that scale.
Historical evidence: - Chess: Deep Blue (search + hardware) beat Kasparov, not clever chess knowledge - Go: AlphaGo (neural nets + search + compute) beat Lee Sedol, not Go heuristics - Speech: End-to-end neural nets replaced hand-crafted phoneme models - Vision: ConvNets replaced hand-crafted features (SIFT, HOG) - NLP: Transformers + scale replaced parse trees, grammars, ontologies
Classic robotics is full of hand-crafted approaches: - PID controllers with manually tuned gains - Hand-designed motion planners (RRT, A*) - Carefully engineered state machines - Physics-based models with hand-measured parameters
The bitter lesson predicts: end-to-end learned controllers that scale with data and compute will eventually beat all of these.
Write 300+ words: "Is the bitter lesson already playing out in robotics? Where does hand-crafted still win, and for how long?"
Consider: - Navigation: ROS nav stack vs learned navigation - Manipulation: trajectory optimization vs diffusion policy - Where does interpretability/safety override the bitter lesson? - Your own OKS system — which components are hand-crafted? Which could be learned?
Revisit the diagram you've been building since Day 1. Add:
Updated Mental Map (Day 26):
Information Theory (Day 5)
│
├── Compression = Understanding
│ │
│ └── Scaling Laws (Day 25)
│ │
│ ├── L(C) ∝ C^(-α)
│ ├── Chinchilla: 20 tokens/param
│ └── More compute = better compression
│
Transformer (Day 14)
│
├── Attention = Dynamic routing
├── Same arch scales to any size
└── The Bitter Lesson: scale > cleverness
→ Implication for VLAs:
Can we find a single architecture that scales
from toy tasks to real-world robot intelligence?
This is the midpoint of Phase II. You now understand why transformers work (attention), how they work (architecture), and what happens when you scale them (power laws). The remaining days in this phase will cover generation strategies (Day 27), alternative architectures (Day 28), and a capstone project (Days 29-30) that ties it all together.