Day 73: PPO & RLHF Connection

Phase VI — Robot Learning: RL, Diffusion & Data | Week 11 | 2.5 hours "PPO is the workhorse of modern RL — simple enough to implement, stable enough to actually work." — John Schulman

Previous: Day 72: RL Foundations Day 2
Next: Day 74: Diffusion Models — DDPM
Week: Week 11 Overview
Phase: Phase VI: Robot Learning
Curriculum: Full Curriculum

Theory (45 min)

73.1 Why Not Just Use Policy Gradients?

Vanilla policy gradient (REINFORCE, A2C) has a critical flaw: step size sensitivity.

Too small step  → painfully slow learning
Too large step  → policy collapses, catastrophic forgetting
Just right      → unknown in advance, changes during training

Trust Region idea: constrain how much the policy can change per update.

73.2 PPO: Clipped Surrogate Objective

PPO approximates a trust region using a simple clipped objective:

$$L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min\left( r_t(\theta) \hat{A}_t, \; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right]$$

where: - $r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_\text{old}}(a_t | s_t)}$ — probability ratio - $\epsilon = 0.2$ — clipping threshold - $\hat{A}_t$ — GAE advantage from Day 72

           r(θ)·Â
             │
    ─────────┼───────── (1+ε)·Â
             │ ╱
             │╱  ← clipped: prevents too-large updates
    ─────────┼───────── (1-ε)·Â
            ╱│
           ╱ │
    ──────╱──┼─────────
         ╱   │
     r(θ): 0.8  1.0  1.2

Why it works: if the advantage is positive and the ratio exceeds $1+\epsilon$, the gradient is zeroed out. The policy can't move too far in one update.

73.3 PPO Training Loop

repeat:
  1. Collect N timesteps with current policy π_old
  2. Compute GAE advantages Â_t
  3. For K epochs over the collected data:
     a. Compute r(θ) = π_θ(a|s) / π_old(a|s)
     b. Compute clipped objective L^CLIP
     c. Update θ via gradient ascent
  4. Set π_old ← π_θ

Key hyperparameters:

Parameter	Typical Value	Effect
$\epsilon$ (clip)	0.2	Trust region width
K (epochs)	3–10	Reuse of collected data
N (batch)	2048–8192	Data per update
$\gamma$	0.99	Discount factor
$\lambda$ (GAE)	0.95	Advantage bias-variance
lr	3e-4	Learning rate

73.4 The RLHF Connection

Recall Day 33: RLHF trains LLMs using PPO where:

RL Component	RLHF Mapping
State $s$	Prompt + partial generation
Action $a$	Next token
Policy $\pi_\theta$	Language model
Reward $R$	Reward model score
$\pi_\text{ref}$	Frozen pre-trained LM (KL penalty)

$$\text{RLHF objective} = \mathbb{E}\left[R(y) - \beta \cdot D_\text{KL}(\pi_\theta \| \pi_\text{ref})\right]$$

Same algorithm, different domain: PPO clips policy updates to prevent catastrophic forgetting — whether the policy generates text or robot actions.

73.5 PPO for Robot Learning

For continuous-action robot control: - Policy outputs Gaussian: $\pi_\theta(a|s) = \mathcal{N}(\mu_\theta(s), \sigma_\theta(s))$ - Actions: joint velocities, end-effector deltas - Reward: task-specific (reach target, grasp object)

Challenge: reward engineering is hard for robots. This motivates learning from demonstrations (Day 78) and diffusion-based policies (Day 81) instead.

Implementation (60 min)

PPO Implementation

import gymnasium as gym
import torch
import torch.nn as nn
import numpy as np

class PPOActorCritic(nn.Module):
    def __init__(self, state_dim, action_dim, hidden=64):
        super().__init__()
        self.actor = nn.Sequential(
            nn.Linear(state_dim, hidden), nn.Tanh(),
            nn.Linear(hidden, hidden), nn.Tanh(),
            nn.Linear(hidden, action_dim),
        )
        self.critic = nn.Sequential(
            nn.Linear(state_dim, hidden), nn.Tanh(),
            nn.Linear(hidden, hidden), nn.Tanh(),
            nn.Linear(hidden, 1),
        )

    def get_action_and_value(self, state):
        logits = self.actor(state)
        dist = torch.distributions.Categorical(logits=logits)
        action = dist.sample()
        return action, dist.log_prob(action), dist.entropy(), self.critic(state).squeeze(-1)

    def evaluate(self, state, action):
        logits = self.actor(state)
        dist = torch.distributions.Categorical(logits=logits)
        return dist.log_prob(action), dist.entropy(), self.critic(state).squeeze(-1)

def ppo_train(env_name="CartPole-v1", total_steps=100000,
              n_steps=2048, n_epochs=10, clip_eps=0.2,
              gamma=0.99, lam=0.95, lr=3e-4):

    env = gym.make(env_name)
    model = PPOActorCritic(env.observation_space.shape[0], env.action_space.n)
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)

    state, _ = env.reset()
    ep_rewards, current_reward = [], 0

    for update in range(total_steps // n_steps):
        # --- Collect rollout ---
        states, actions, rewards, dones = [], [], [], []
        old_log_probs, values = [], []

        for _ in range(n_steps):
            state_t = torch.FloatTensor(state)
            with torch.no_grad():
                action, log_prob, _, value = model.get_action_and_value(state_t)

            states.append(state_t)
            actions.append(action)
            old_log_probs.append(log_prob)
            values.append(value)

            state, reward, term, trunc, _ = env.step(action.item())
            rewards.append(reward)
            dones.append(float(term or trunc))
            current_reward += reward

            if term or trunc:
                ep_rewards.append(current_reward)
                current_reward = 0
                state, _ = env.reset()

        # --- Compute GAE ---
        vals = [v.item() for v in values] + [0]
        advantages, gae = [], 0
        for t in reversed(range(n_steps)):
            delta = rewards[t] + gamma * vals[t+1] * (1-dones[t]) - vals[t]
            gae = delta + gamma * lam * (1-dones[t]) * gae
            advantages.insert(0, gae)

        advantages = torch.tensor(advantages)
        returns = advantages + torch.tensor(vals[:-1])
        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

        states = torch.stack(states)
        actions = torch.stack(actions)
        old_log_probs = torch.stack(old_log_probs)

        # --- PPO update ---
        for epoch in range(n_epochs):
            new_log_probs, entropy, new_values = model.evaluate(states, actions)
            ratio = (new_log_probs - old_log_probs.detach()).exp()

            # Clipped objective
            surr1 = ratio * advantages
            surr2 = torch.clamp(ratio, 1-clip_eps, 1+clip_eps) * advantages
            actor_loss = -torch.min(surr1, surr2).mean()
            critic_loss = ((new_values - returns.detach())**2).mean()

            loss = actor_loss + 0.5 * critic_loss - 0.01 * entropy.mean()

            optimizer.zero_grad()
            loss.backward()
            nn.utils.clip_grad_norm_(model.parameters(), 0.5)
            optimizer.step()

        if ep_rewards:
            print(f"Update {update}: avg reward = {np.mean(ep_rewards[-20:]):.1f}")

    return model, ep_rewards

Exercise (45 min)

Clip visualization: For fixed $\hat{A} > 0$, plot $L^\text{CLIP}$ vs $r(\theta)$. Identify the flat region where gradient = 0.
Epsilon ablation: Train PPO with $\epsilon \in \{0.05, 0.1, 0.2, 0.3, 0.5\}$. What breaks at extremes?
RLHF simulation: Implement a toy RLHF loop: use PPO to fine-tune a small character-level LM where the "reward model" is just string length. Observe mode collapse vs KL penalty.
Compare REINFORCE vs A2C vs PPO: Run all three on CartPole with same number of environment steps. Plot learning curves.

Key Takeaways

PPO clips the probability ratio to prevent destructive policy updates
Same data, multiple epochs — PPO reuses collected trajectories (K epochs)
PPO = the algorithm behind RLHF — Day 33's missing implementation detail
Reward engineering is the bottleneck for robot RL → motivates imitation learning
GAE + PPO + clipping = the standard recipe for modern RL

Connection to the Thread

PPO completes the RL toolkit. But robot RL's Achilles heel is reward design: how do you specify "pick up the mug carefully" as a scalar reward? The answer is you don't — you show demonstrations instead. Before we get to imitation learning (Week 12), we need one more tool: diffusion models. Starting tomorrow, we learn to generate by denoising — a framework that will transform how robots produce actions.