Phase II — Attention, Transformers & Scaling | Week 4 | 2.5 hours "Every problem is a text-to-text problem, if you look at it right." — Colin Raffel
T5 (Text-to-Text Transfer Transformer) proposed a radical unification: every NLP task is text-in, text-out.
Traditional approach (task-specific heads):
Input ──→ [Shared Encoder] ──→ [Classification Head] ──→ Label
└→ [Regression Head] ──→ Number
└→ [Seq2Seq Head] ──→ Text
T5 approach (one format):
"translate English to German: The house is wonderful" ──→ T5 ──→ "Das Haus ist wunderbar"
"summarize: <long article>" ──→ T5 ──→ "<summary>"
"sst2 sentence: I loved this movie" ──→ T5 ──→ "positive"
"cola sentence: The dog ran quickly" ──→ T5 ──→ "acceptable"
Why this matters: The task instruction is part of the input. The model learns what to do from the text prefix. This is a precursor to instruction-following LLMs (ChatGPT, etc.).
T5 uses the original encoder-decoder transformer architecture (Vaswani et al., 2017):
T5 Architecture:
Input tokens Target tokens
│ │
▼ ▼
[Embedding] [Embedding]
│ │
▼ ▼
┌──────────────┐ ┌──────────────────┐
│ Encoder │ │ Decoder │
│ Block ×N │ │ Block ×N │
│ │ │ │
│ Self-Attn │ ┌────────────→│ Masked Self-Attn │
│ (bidirect.) │ │ │ Cross-Attention │←─ attends to encoder
│ FFN │ │ │ FFN │
│ LayerNorm │─────┘ │ LayerNorm │
└──────────────┘ └──────────────────┘
│
▼
[Linear + Softmax]
│
▼
Output tokens
Three types of attention in T5:
| Attention Type | Query | Key/Value | Masking |
|---|---|---|---|
| Encoder self-attention | Encoder | Encoder | None (bidirectional) |
| Decoder self-attention | Decoder | Decoder | Causal (can only see past) |
| Cross-attention | Decoder | Encoder | None (can see all input) |
This is the most important comparison in modern NLP:
┌─────────────────────────────────────────────────────────────────┐
│ │
│ ENCODER-ONLY (BERT) DECODER-ONLY (GPT) │
│ │
│ Input: "The [MASK] sat" Input: "The cat" │
│ ↓ ↓ │
│ ┌─────────────┐ ┌──────────────┐ │
│ │ Bidirectional│ │ Causal (L→R) │ │
│ │ Self-Attn │ │ Self-Attn │ │
│ └─────────────┘ └──────────────┘ │
│ ↓ ↓ │
│ Output: "cat" Output: "sat" │
│ (fill in blank) (predict next) │
│ │
│ ENCODER-DECODER (T5) │
│ │
│ Input: "translate: The cat sat" ──→ [Encoder] ──→ [Decoder] │
│ ↓ │
│ Output: "Le chat..." │
│ │
└─────────────────────────────────────────────────────────────────┘
| Aspect | BERT (Enc-only) | GPT (Dec-only) | T5 (Enc-Dec) |
|---|---|---|---|
| Pre-training | Masked LM | Autoregressive LM | Span corruption |
| Bidirectional context | ✅ Yes | ❌ No (causal) | ✅ Encoder, ❌ Decoder |
| Generation | ❌ Poor | ✅ Excellent | ✅ Good |
| Understanding | ✅ Excellent | ⚠️ Good | ✅ Excellent |
| Few-shot learning | ❌ Needs fine-tuning | ✅ In-context | ⚠️ Mixed |
| Parameters (for same quality) | Smallest | Largest | Medium |
| Won in practice | NLU tasks (2018-2020) | Everything (2022+) | Seq2seq tasks |
GPT-style models dominate today. Why?
T5-style models remain better for specific tasks:
Translation: The encoder processes the full source sentence bidirectionally before the decoder generates the target. This bidirectional encoding captures source context that causal decoders miss.
Summarization: Full bidirectional understanding of the source document is crucial for selecting what to include.
Structured prediction: When the output is significantly shorter than and structurally different from the input.
Conditional generation with complex inputs: When you need to deeply process the "condition" (image, document, etc.) before generating.
The VLA connection: Many VLAs use encoder-decoder–like structures! The vision encoder processes the image bidirectionally, and the action decoder generates actions autoregressively. This is essentially a T5 architecture where the "encoder" is a ViT and the "decoder" generates actions instead of text.
T5 doesn't use next-token prediction or masked LM. Instead, it uses span corruption:
Original: "The quick brown fox jumps over the lazy dog"
Corrupted: "The <X> brown fox <Y> the lazy dog"
↓
Target: "<X> quick <Y> jumps over"
<X>, <Y>, etc.Mean span length of 3 tokens, corruption rate of 15% worked best in T5's ablation study.
from transformers import T5ForConditionalGeneration, T5Tokenizer
import torch
# Load T5-small (60M params — fast for experimentation)
model_name = "t5-small"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)
model.eval()
def summarize_with_t5(text: str, max_length: int = 150) -> str:
"""Summarize text using T5's text-to-text format."""
# T5 expects the task prefix
input_text = f"summarize: {text}"
input_ids = tokenizer.encode(
input_text,
return_tensors="pt",
max_length=512,
truncation=True
)
with torch.no_grad():
output_ids = model.generate(
input_ids,
max_length=max_length,
num_beams=4, # beam search for quality
length_penalty=2.0, # encourage longer summaries
early_stopping=True,
no_repeat_ngram_size=3,
)
return tokenizer.decode(output_ids[0], skip_special_tokens=True)
# Example
article = """
Scaling laws for neural language models show that performance improves
predictably as a power law with increased compute, data, and parameters.
The Chinchilla paper demonstrated that most large language models were
significantly undertrained, recommending approximately 20 training tokens
per parameter for compute-optimal training. This finding reshaped how
the industry approaches model training, leading to models like LLaMA
that deliberately over-train smaller models on more data to optimize
inference-time efficiency rather than training-time efficiency.
"""
summary = summarize_with_t5(article)
print(f"Summary: {summary}")
def translate_with_t5(text: str, source: str = "English", target: str = "German"):
"""Translate text using T5."""
input_text = f"translate {source} to {target}: {text}"
input_ids = tokenizer.encode(input_text, return_tensors="pt", max_length=512, truncation=True)
with torch.no_grad():
output_ids = model.generate(
input_ids,
max_length=512,
num_beams=4,
early_stopping=True,
)
return tokenizer.decode(output_ids[0], skip_special_tokens=True)
# Examples
sentences = [
"The cat sits on the mat.",
"Scaling laws predict model performance.",
"Robots must learn from experience.",
]
for sent in sentences:
translation = translate_with_t5(sent)
print(f" EN: {sent}")
print(f" DE: {translation}\n")
from transformers import GPT2LMHeadModel, GPT2Tokenizer
def summarize_with_gpt2(text: str, max_new_tokens: int = 100) -> str:
"""Attempt summarization with GPT-2 using prompting."""
gpt2_tok = GPT2Tokenizer.from_pretrained("gpt2")
gpt2_model = GPT2LMHeadModel.from_pretrained("gpt2")
gpt2_model.eval()
# GPT-2 doesn't have a task prefix — use prompting
prompt = f"Article: {text}\n\nSummary:"
input_ids = gpt2_tok.encode(prompt, return_tensors="pt")
with torch.no_grad():
output_ids = gpt2_model.generate(
input_ids,
max_new_tokens=max_new_tokens,
temperature=0.7,
top_p=0.9,
do_sample=True,
no_repeat_ngram_size=3,
)
full_text = gpt2_tok.decode(output_ids[0], skip_special_tokens=True)
# Extract just the generated summary
summary = full_text[len(prompt):]
return summary.strip()
# Compare on the same article
print("=== T5 Summary ===")
print(summarize_with_t5(article))
print("\n=== GPT-2 Summary ===")
print(summarize_with_gpt2(article))
def visualize_cross_attention(model, tokenizer, input_text: str, target_text: str):
"""Visualize which input tokens the decoder attends to."""
import matplotlib.pyplot as plt
input_ids = tokenizer.encode(input_text, return_tensors="pt")
target_ids = tokenizer.encode(target_text, return_tensors="pt")
with torch.no_grad():
outputs = model(
input_ids=input_ids,
decoder_input_ids=target_ids,
output_attentions=True,
)
# Cross-attention from last decoder layer, first head
cross_attn = outputs.cross_attentions[-1][0, 0].numpy() # (dec_len, enc_len)
# Token labels
input_tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
target_tokens = tokenizer.convert_ids_to_tokens(target_ids[0])
fig, ax = plt.subplots(figsize=(12, 6))
im = ax.imshow(cross_attn, cmap="Blues", aspect="auto")
ax.set_xticks(range(len(input_tokens)))
ax.set_xticklabels(input_tokens, rotation=45, ha="right", fontsize=8)
ax.set_yticks(range(len(target_tokens)))
ax.set_yticklabels(target_tokens, fontsize=8)
ax.set_xlabel("Encoder (Input)")
ax.set_ylabel("Decoder (Output)")
ax.set_title("Cross-Attention: What does the decoder look at?")
plt.colorbar(im)
plt.tight_layout()
plt.savefig("cross_attention.png", dpi=150)
plt.show()
# visualize_cross_attention(model, tokenizer,
# "translate English to German: The cat sits on the mat",
# "Die Katze sitzt auf der Matte")
Fill in this table from your understanding (not from looking it up):
| Feature | GPT (Decoder) | BERT (Encoder) | T5 (Enc-Dec) |
|---|---|---|---|
| Pre-training objective | ? | ? | ? |
| Bidirectional input context | ? | ? | ? |
| Can generate text | ? | ? | ? |
| Few-shot capable | ? | ? | ? |
| Parameter efficiency | ? | ? | ? |
| Best for classification | ? | ? | ? |
| Best for translation | ? | ? | ? |
| Best for open-ended generation | ? | ? | ? |
List 3 tasks where an encoder-decoder architecture is strictly better than decoder-only:
Task: ___
Why: ___
Task: ___
Why: ___
Task: ___
Why: ___
Hint: Think about tasks where the output is much shorter than the input, where bidirectional understanding of the input is critical, or where the output has different structure than the input.
Answer: "Is a VLA more like GPT, BERT, or T5? Why?"
Consider: - The image encoder processes visual input bidirectionally (like BERT/T5 encoder) - The action decoder generates actions autoregressively (like GPT/T5 decoder) - Cross-attention connects vision to action (like T5 cross-attention)
Draw the VLA as a T5-like architecture:
Image ──→ [Vision Encoder] ──→ cross-attn ──→ [Action Decoder] ──→ Actions
↑
Text ──→ [Language Model ] ────────┘
T5's encoder-decoder architecture is the hidden blueprint of VLAs. When RT-2 processes an image through a ViT encoder and generates actions through a decoder, it's running a T5-like architecture where the "text" is robot commands. Understanding when and why encoder-decoder works better than decoder-only helps you reason about VLA design choices.