Phase II — Attention, Transformers & Scaling | Week 4 | 2.5 hours "How many Rs in strawberry? If you can't answer that, your tokenizer is showing." — @karpathy
Models don't see text. They see integer sequences. How you convert text → integers has profound consequences:
Text: "unhappiness"
Character: ['u','n','h','a','p','p','i','n','e','s','s'] → 11 tokens
Word: ['unhappiness'] → 1 token (but vocab=∞)
Subword: ['un', 'happiness'] → 2 tokens ← sweet spot
BPE: ['un', 'happ', 'iness'] → 3 tokens (depends on training)
| Level | Vocab Size | Seq Length | OOV? | Morphology |
|---|---|---|---|---|
| Character | ~256 | Very long | Never | Must learn from scratch |
| Word | 100K+ | Short | Yes, common | Captures full words |
| Subword (BPE) | 32K–100K | Medium | Rare | Learns morphemes |
BPE is the dominant tokenization algorithm. It's a compression algorithm repurposed for NLP.
Algorithm:
Corpus: "low low low low low lowest lowest newer newer newer wider wider"
Step 0 — Character vocabulary:
{'l','o','w','e','s','t','n','r','i','d',' '}
Step 0 — Tokenized corpus:
l o w l o w l o w l o w l o w l o w e s t l o w e s t
n e w e r n e w e r n e w e r w i d e r w i d e r
Step 1 — Most frequent pair: ('l', 'o') → count: 7
Merge 'l'+'o' → 'lo'
lo w lo w lo w lo w lo w lo w e s t lo w e s t
n e w e r n e w e r n e w e r w i d e r w i d e r
Step 2 — Most frequent pair: ('lo', 'w') → count: 7
Merge 'lo'+'w' → 'low'
low low low low low low e s t low e s t
n e w e r n e w e r n e w e r w i d e r w i d e r
Step 3 — Most frequent pair: ('e', 'r') → count: 5
Merge 'e'+'r' → 'er'
low low low low low low e s t low e s t
n e w er n e w er n e w er w i d er w i d er
Step 4 — Most frequent pair: ('e', 'w') → count: 3
Merge 'e'+'w' → 'ew'
...and so on until vocab size reached
Both are subword tokenization but differ in the merge criterion:
BPE: Merges the most frequent pair.
WordPiece: Merges the pair that maximizes the likelihood of the training data: $$\text{score}(a, b) = \frac{\text{freq}(ab)}{\text{freq}(a) \times \text{freq}(b)}$$
This favors merging rare subwords that appear together, rather than just common pairs.
In practice: The difference is minor. BPE is more common in modern models.
Problem: Standard BPE depends on pre-tokenization (splitting on spaces). This fails for languages without spaces (Chinese, Japanese) and creates language bias.
SentencePiece (Kudo & Richardson, 2018):
- Treats the input as a raw stream of characters (including spaces as ▁)
- Language-agnostic — no pre-tokenization needed
- Used by LLaMA, T5, mBART
Byte-Level BPE (GPT-2/3/4, tiktoken):
- Start with 256 byte tokens (UTF-8 bytes)
- Build BPE merges on bytes, not characters
- Can represent ANY text (including code, emoji, non-Latin scripts)
- tiktoken is OpenAI's fast implementation
# tiktoken example
import tiktoken
enc = tiktoken.get_encoding("cl100k_base") # GPT-4 encoding
text = "Hello, world! 🌍 こんにちは"
tokens = enc.encode(text)
print(f"Text: {text}")
print(f"Tokens: {tokens}")
print(f"Decoded pieces: {[enc.decode([t]) for t in tokens]}")
# ['Hello', ',', ' world', '!', ' 🌍', ' ', 'こん', 'にち', 'は']
Tokenization creates blindness — the model can't reason about sub-token structure:
"How many Rs in strawberry?"
GPT-4 tokenization:
['How', ' many', ' Rs', ' in', ' str', 'aw', 'berry', '?']
↑
'strawberry' is split into 'str' + 'aw' + 'berry'
The model sees 3 tokens, not individual characters
It cannot count letters because it never sees them!
Other artifacts:
" Hello" ≠ "Hello" (leading space is a separate token)
"1+1=2" tokenized as: ['1', '+', '1', '=', '2'] ← fine
"12345" tokenized as: ['123', '45'] ← number split!
Python code:
" if x:" → [' ', 'if', ' x', ':']
vs
" if x:" → [' ', 'if', ' x', ':']
Different indentation = different first token!
Why this matters for LLMs: - Arithmetic errors (numbers split unpredictably) - Spelling/character-level tasks fail - Multilingual efficiency varies wildly (English: ~1 token/word, Japanese: ~3 tokens/character) - Code formatting sensitivity
| Vocab Size | Sequence Length | Embedding Params | Quality |
|---|---|---|---|
| 256 (bytes) | Very long | 256 × d | Must learn everything |
| 32K (GPT-2) | Medium | 32K × d (~25M) | Good general balance |
| 50K (BERT) | Shorter | 50K × d (~38M) | Better word coverage |
| 100K (GPT-4) | Shortest | 100K × d (~77M) | Best coverage, most params |
Rule of thumb: Larger vocab → shorter sequences → faster training, but more embedding parameters and harder to train rare tokens.
"""
Minimal BPE implementation.
Based on Karpathy's minbpe: https://github.com/karpathy/minbpe
"""
def get_pair_counts(token_ids):
"""Count frequency of adjacent token pairs."""
counts = {}
for a, b in zip(token_ids, token_ids[1:]):
counts[(a, b)] = counts.get((a, b), 0) + 1
return counts
def merge_pair(token_ids, pair, new_id):
"""Replace all occurrences of pair with new_id."""
result = []
i = 0
while i < len(token_ids):
if i < len(token_ids) - 1 and (token_ids[i], token_ids[i+1]) == pair:
result.append(new_id)
i += 2
else:
result.append(token_ids[i])
i += 1
return result
class MinBPE:
"""Minimal BPE tokenizer."""
def __init__(self):
self.merges = {} # (int, int) → int
self.vocab = {} # int → bytes
def train(self, text, vocab_size=276):
"""Train BPE on text. Default: 276 = 256 bytes + 20 merges."""
assert vocab_size >= 256, "Vocab must include all bytes"
n_merges = vocab_size - 256
# Convert text to bytes
token_ids = list(text.encode("utf-8"))
print(f"Training BPE: {len(token_ids)} bytes → {n_merges} merges")
# Build initial vocab: byte → byte
self.vocab = {i: bytes([i]) for i in range(256)}
for i in range(n_merges):
counts = get_pair_counts(token_ids)
if not counts:
break
# Find most frequent pair
best_pair = max(counts, key=counts.get)
new_id = 256 + i
# Merge
token_ids = merge_pair(token_ids, best_pair, new_id)
# Record
self.merges[best_pair] = new_id
self.vocab[new_id] = self.vocab[best_pair[0]] + self.vocab[best_pair[1]]
if i < 10 or i % 50 == 0:
piece = self.vocab[new_id].decode("utf-8", errors="replace")
print(f" Merge {i}: {best_pair} → {new_id} "
f"('{piece}', count={counts[best_pair]})")
print(f"Final: {len(token_ids)} tokens "
f"(compression: {len(text.encode('utf-8'))/len(token_ids):.1f}x)")
return self
def encode(self, text):
"""Encode text to token IDs."""
token_ids = list(text.encode("utf-8"))
while len(token_ids) >= 2:
counts = get_pair_counts(token_ids)
# Find the pair with the lowest merge index (earliest learned)
best = min(
counts,
key=lambda p: self.merges.get(p, float('inf'))
)
if best not in self.merges:
break
token_ids = merge_pair(token_ids, best, self.merges[best])
return token_ids
def decode(self, token_ids):
"""Decode token IDs back to text."""
raw = b"".join(self.vocab[t] for t in token_ids)
return raw.decode("utf-8", errors="replace")
# Train on a sample corpus
corpus = """
The transformer architecture has revolutionized natural language processing.
Attention is all you need. Self-attention allows each token to attend to every
other token in the sequence. Multi-head attention splits the attention into
multiple heads, each learning different patterns. The transformer uses
positional encodings to inject sequence order information.
""" * 100 # Repeat for more data
bpe = MinBPE()
bpe.train(corpus, vocab_size=356) # 256 bytes + 100 merges
# Test encoding
test = "The transformer uses attention mechanisms."
encoded = bpe.encode(test)
decoded = bpe.decode(encoded)
print(f"\nOriginal: '{test}'")
print(f"Encoded: {encoded}")
print(f"N tokens: {len(encoded)} (vs {len(test)} chars)")
print(f"Decoded: '{decoded}'")
print(f"Roundtrip: {test == decoded}")
# Show the learned vocabulary pieces
print("\nLearned merges (first 20):")
for (a, b), new_id in list(bpe.merges.items())[:20]:
piece = bpe.vocab[new_id].decode("utf-8", errors="replace")
print(f" {a} + {b} → {new_id} = '{piece}'")
import tiktoken
enc = tiktoken.get_encoding("cl100k_base") # GPT-4 encoding
test_texts = [
"Hello, world!",
"The quick brown fox jumps over the lazy dog.",
"unhappiness",
"How many Rs in strawberry?",
"def fibonacci(n):\n if n <= 1:\n return n",
"こんにちは世界", # "Hello world" in Japanese
"🎉🎊🎈", # Emoji
"3.14159265358979323846", # Pi digits
]
print(f"{'Text':<50} {'tiktoken':>10} {'BPE':>10} {'Chars':>10}")
print("-" * 85)
for text in test_texts:
n_tiktoken = len(enc.encode(text))
n_bpe = len(bpe.encode(text))
n_chars = len(text)
display = text[:47] + "..." if len(text) > 50 else text
print(f"{display:<50} {n_tiktoken:>10} {n_bpe:>10} {n_chars:>10}")
# Inspect tokenization of tricky cases
for text in ["strawberry", "How many Rs in strawberry?"]:
tokens = enc.encode(text)
pieces = [enc.decode([t]) for t in tokens]
print(f"\ntiktoken({text!r}):")
print(f" Pieces: {pieces}")
print(f" IDs: {tokens}")
Train your MinBPE on different corpus types and measure compression:
corpora = {
"english": "The quick brown fox..." * 1000,
"python_code": "def func(x):\n return x * 2\n" * 1000,
"numbers": "3141592653589793238462643383279" * 100,
"repetitive": "aaabbbcccaaabbbccc" * 1000,
}
for name, text in corpora.items():
bpe = MinBPE()
bpe.train(text, vocab_size=356)
encoded = bpe.encode(text)
ratio = len(text.encode('utf-8')) / len(encoded)
print(f"{name}: {ratio:.1f}x compression")
Which corpus compresses best? Why?
Test multiple tokenizers on character-level tasks:
# Task: Count occurrences of 'r' in these words
words = ["strawberry", "mirror", "raspberry", "error", "corridor"]
for word in words:
tokens_gpt4 = enc.encode(word)
pieces = [enc.decode([t]) for t in tokens_gpt4]
actual_count = word.count('r')
print(f"'{word}': actual r-count={actual_count}, "
f"tokens={pieces}, n_tokens={len(tokens_gpt4)}")
# Can the model count 'r's if it sees these pieces?
# Does any piece split an 'r' from its context?
Compare token count for the same meaning in different languages:
sentences = {
"English": "Artificial intelligence will transform the world.",
"Japanese": "人工知能は世界を変革するでしょう。",
"Chinese": "人工智能将改变世界。",
"Russian": "Искусственный интеллект изменит мир.",
"Arabic": "سيغير الذكاء الاصطناعي العالم.",
}
for lang, text in sentences.items():
tokens = enc.encode(text)
chars = len(text)
ratio = len(tokens) / len(text.split()) if ' ' in text else len(tokens)
print(f"{lang:10s}: {len(tokens):3d} tokens, {chars:3d} chars, "
f"tokens/char={len(tokens)/chars:.2f}")
Why is this ratio so different across languages? What are the implications for multilingual LLMs?
Tokenization is the invisible foundation under everything you've built. Every attention matrix, every embedding, every generation step operates on tokens — not characters or words. Understanding tokenization explains many "surprising" LLM failures (arithmetic, spelling, multilingual inconsistency). Tomorrow, you'll build GPT from scratch with nanoGPT, and you'll see tokenization in action as the first step of the entire pipeline.