Phase VII — Game Theory for Optimization & Robotics | Week 16 | 2.5 hours Guarantee bounded regret regardless of what the environment (or opponents) do — the foundation of online decision-making.
OKS task allocation is an online problem: tasks arrive over time, and robots must choose which to accept without knowing future arrivals. No-regret algorithms guarantee that over time, no robot would have been better off following any fixed strategy — critical for fair, efficient fleet-wide task distribution without central planning.
A player selects actions $a_1, a_2, \ldots, a_T$ from a set of $n$ actions. After each round, a loss vector $\ell^t \in [0,1]^n$ is revealed. The external regret after $T$ rounds is:
$$R_T = \sum_{t=1}^{T} \ell^t(a_t) - \min_{i \in [n]} \sum_{t=1}^{T} \ell^t(i)$$
This compares cumulative loss to the best fixed action in hindsight. An algorithm is no-regret (or Hannan consistent) if $R_T / T \to 0$.
Hart & Mas-Colell (2000): For each action $i$, define the cumulative regret for not having played $i$:
$$r_i^t = \sum_{\tau=1}^{t} \left[\ell^\tau(a_\tau) - \ell^\tau(i)\right]$$
Let $r_i^{t+} = \max(r_i^t, 0)$. The regret matching strategy at time $t+1$ is:
$$\sigma^{t+1}(i) = \frac{r_i^{t+}}{\sum_j r_j^{t+}}$$
If all regrets are non-positive, play uniformly. Theorem: Regret matching achieves $R_T = O(\sqrt{T \cdot n})$.
Initialize weights $w_i^1 = 1$ for each action $i$. At each round:
With learning rate $\eta = \sqrt{\ln n / T}$:
$$R_T \leq 2\sqrt{T \ln n}$$
This is optimal up to constants — no algorithm can achieve $R_T = o(\sqrt{T \log n})$ against adversarial losses.
Theorem (Freund & Schapire 1999): If both players in a zero-sum game use no-regret algorithms, their average strategies converge to a Nash equilibrium at rate $O(\sqrt{\log n / T})$.
Specifically, if $\bar{\sigma}_1^T$ and $\bar{\sigma}_2^T$ are the time-averaged mixed strategies:
$$\max_x x^\top A \bar{\sigma}_2^T - \bar{\sigma}_1^{T\top} A \min_y \bar{\sigma}_1^{T\top} A y \leq O\left(\sqrt{\frac{\ln n}{T}}\right)$$
"""No-Regret Learning: Regret Matching and Multiplicative Weights."""
import numpy as np
import matplotlib.pyplot as plt
def multiplicative_weights(losses: np.ndarray, eta: float = None) -> dict:
"""Multiplicative Weights Update (Hedge) algorithm.
Args:
losses: T x n array of loss vectors
eta: learning rate (auto-set if None)
Returns:
dict with strategies, cumulative regret, and regret bound
"""
T, n = losses.shape
if eta is None:
eta = np.sqrt(np.log(n) / T)
weights = np.ones(n)
strategies = np.zeros((T, n))
cumulative_loss = np.zeros(n)
player_loss = 0.0
regret_history = []
for t in range(T):
# Play proportional to weights
sigma = weights / weights.sum()
strategies[t] = sigma
# Incur loss
player_loss += sigma @ losses[t]
cumulative_loss += losses[t]
# Regret: player loss - best fixed action
regret = player_loss - cumulative_loss.min()
regret_history.append(regret)
# Update weights
weights *= np.exp(-eta * losses[t])
return {
'strategies': strategies,
'regret': np.array(regret_history),
'avg_strategy': strategies.mean(axis=0),
'bound': 2 * np.sqrt(T * np.log(n)),
}
def regret_matching(game_A: np.ndarray, T: int = 5000) -> dict:
"""Regret matching for a 2-player zero-sum game.
Both players use regret matching; returns average strategies.
"""
m, n = game_A.shape
cum_regret1 = np.zeros(m)
cum_regret2 = np.zeros(n)
sum_strategy1 = np.zeros(m)
sum_strategy2 = np.zeros(n)
avg1_hist, avg2_hist = [], []
for t in range(1, T + 1):
# Strategies from regret
pos1 = np.maximum(cum_regret1, 0)
sigma1 = pos1 / pos1.sum() if pos1.sum() > 0 else np.ones(m) / m
pos2 = np.maximum(cum_regret2, 0)
sigma2 = pos2 / pos2.sum() if pos2.sum() > 0 else np.ones(n) / n
sum_strategy1 += sigma1
sum_strategy2 += sigma2
# Compute regrets
# Player 1: could have played pure i instead
val1 = sigma1 @ game_A @ sigma2
for i in range(m):
cum_regret1[i] += game_A[i] @ sigma2 - val1
# Player 2: playing -A (zero-sum)
val2 = sigma1 @ (-game_A) @ sigma2
for j in range(n):
cum_regret2[j] += sigma1 @ (-game_A[:, j]) - val2
avg1_hist.append(sum_strategy1 / t)
avg2_hist.append(sum_strategy2 / t)
return {
'avg_strategy1': sum_strategy1 / T,
'avg_strategy2': sum_strategy2 / T,
'avg1_hist': np.array(avg1_hist),
'avg2_hist': np.array(avg2_hist),
}
# Demo 1: MWU against adversarial losses
np.random.seed(42)
T, n = 2000, 5
losses = np.random.rand(T, n)
result_mwu = multiplicative_weights(losses)
print("MWU against random losses:")
print(f" Final regret: {result_mwu['regret'][-1]:.2f}")
print(f" Theoretical bound: {result_mwu['bound']:.2f}")
print(f" Avg regret per round: {result_mwu['regret'][-1]/T:.4f}")
# Demo 2: Regret matching on Rock-Paper-Scissors
A_rps = np.array([[0, -1, 1], [1, 0, -1], [-1, 1, 0]])
result_rm = regret_matching(A_rps, T=10000)
print(f"\nRPS via regret matching:")
print(f" Avg strategy P1: {np.round(result_rm['avg_strategy1'], 4)}")
print(f" Avg strategy P2: {np.round(result_rm['avg_strategy2'], 4)}")
print(f" NE is (1/3, 1/3, 1/3) — error: "
f"{np.linalg.norm(result_rm['avg_strategy1'] - 1/3):.4f}")
# Plot regret growth vs bound
fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(result_mwu['regret'], label='Actual regret')
T_range = np.arange(1, T + 1)
ax.plot(T_range, 2 * np.sqrt(T_range * np.log(n)), '--', label=r'$2\sqrt{T \ln n}$')
ax.set_xlabel('Round'); ax.set_ylabel('Cumulative Regret')
ax.set_title('MWU Regret vs Theoretical Bound')
ax.legend()
plt.tight_layout()
plt.savefig('no_regret_bound.png', dpi=100)
plt.show()
P1: Implement MWU for $n = 3$ actions with adversarial losses $\ell^t = e_{(t \mod 3)}$ (cycling). Verify $R_T = O(\sqrt{T})$.
P2: Both players in a $3 \times 3$ zero-sum game use MWU. Prove the average strategies form an $\varepsilon$-NE with $\varepsilon = O(\sqrt{\log 3 / T})$.
P3: Compare regret matching and MWU on the same zero-sum game. Which converges faster to NE?
E1: Implement internal regret minimization. Show that if both players minimize internal regret, average play converges to a correlated equilibrium.
E2: Derive the MWU regret bound $R_T \leq \eta \sum_t \|\ell^t\|_\infty^2 / 8 + \ln n / \eta$ using the potential function $\Phi^t = \ln(\sum_i w_i^t)$.
E3: Design an online task allocation system for 4 robots and 3 task types using MWU. Each robot independently learns which tasks to specialize in.