Phase IV · Week 9 · Day 59 of 70 · 2.5 hours
"We borrowed the best idea from operating systems — virtual memory — and applied it to attention. The result was a 24× throughput improvement."
| ← Previous | Next → | 📅 Week | 🔷 Phase | 📚 Curriculum |
|---|---|---|---|---|
| Day 58: KV Cache Optimization | Day 60: Speculative Decoding | Week 9: LLM Serving Systems | Phase IV: Inference & Deployment | ML Compilers |
Before vLLM, LLM serving systems wasted 60-80% of GPU memory on internal fragmentation — pre-allocated KV cache blocks that were reserved but unused. A 70B model on 4× A100s might only serve 8 concurrent requests despite having enough raw memory for 40+. PagedAttention solved this by treating KV cache memory like an OS treats RAM: paged, dynamically allocated, and shared when possible. vLLM became the de facto standard for LLM serving, and PagedAttention's ideas have been adopted by TensorRT-LLM, DeepSpeed, and SGLang. Understanding its internals is essential for anyone deploying LLMs at scale.
Traditional KV cache allocation pre-reserves a contiguous memory block for each request's maximum possible sequence length:
Pre-Allocated KV Cache (Traditional)
═══════════════════════════════════════════════════════════════
GPU Memory Pool (contiguous allocation):
┌──────────────────────────────────────────────────────┐
│ Request A (max 2048) │ Request B (max 2048) │
│ [████████░░░░░░░░░░░░░] │ [█████████████░░░░░░░░░░] │
│ actual=512 wasted=1536 actual=832 wasted=1216 │
├──────────────────────────────────────────────────────┤
│ Request C (max 2048) │ ░░░░░ UNUSABLE ░░░░░░░░░░ │
│ [██░░░░░░░░░░░░░░░░░░] │ (too fragmented for new │
│ actual=128 wasted=1920│ request requiring 2048) │
└──────────────────────────────────────────────────────┘
Total allocated: 6144 slots
Actually used: 1472 slots (24%)
Wasted: 4672 slots (76%) ← catastrophic waste!
Sources of waste:
─────────────────
1. Internal fragmentation: reserved slots beyond actual length
2. External fragmentation: gaps between allocations
3. Pre-allocation: must reserve for max_seq_len upfront
PagedAttention maps the KV cache into fixed-size blocks (pages), indexed via per-request block tables:
PagedAttention Architecture
═══════════════════════════════════════════════════════════════
Physical KV blocks (fixed-size, e.g., 16 tokens each):
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│ P0 │ P1 │ P2 │ P3 │ P4 │ P5 │ P6 │ P7 │
│ Req │ Req │ Req │ FREE│ Req │ Req │ FREE│ Req │
│ A │ B │ A │ │ C │ B │ │ A │
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
Block Tables (logical → physical mapping):
┌──────────────────────────────┐
│ Request A: [P0, P2, P7] │ 3 blocks = 48 tokens allocated
│ Request B: [P1, P5] │ 2 blocks = 32 tokens allocated
│ Request C: [P4] │ 1 block = 16 tokens allocated
│ Free list: [P3, P6] │ 2 blocks available
└──────────────────────────────┘
Key properties:
───────────────
✓ No contiguous allocation needed (blocks are non-contiguous)
✓ No pre-allocation to max length (grow block by block)
✓ Internal fragmentation < 1 block per request
✓ No external fragmentation (any free block fits any request)
✓ Memory utilization: >95% (vs ~24% in traditional)
The analogy to OS virtual memory is precise:
| OS Concept | PagedAttention Analog |
|---|---|
| Virtual address space | Sequence positions (0, 1, 2, ...) |
| Physical pages (4KB) | KV blocks (16 tokens × head_dim) |
| Page table | Block table per request |
| Page fault → allocate | New token → allocate next block |
| Free page list | Free block list |
| Copy-on-write (fork) | Shared prefix blocks |
PagedAttention enables copy-on-write for requests sharing a common prefix (e.g., system prompt):
Copy-on-Write for Shared Prefixes
═══════════════════════════════════════════════════════════════
System prompt: "You are a helpful assistant. Answer concisely."
(3 blocks)
Without CoW:
──────────────
Req A: [PA₀][PA₁][PA₂][PA₃][PA₄] 15 blocks total
Req B: [PB₀][PB₁][PB₂][PB₃] for 3 requests
Req C: [PC₀][PC₁][PC₂][PC₃][PC₄][PC₅]
^^^^ ^^^^ ^^^^
identical prefix (duplicated 3×)
With CoW:
──────────
Shared: [PS₀][PS₁][PS₂] 9 blocks total
│ (saved 6 blocks = 40%)
┌──────────┼──────────┐
▼ ▼ ▼
Req A: [PA₃][PA₄] Req B: [PB₃] Req C: [PC₃][PC₄][PC₅]
Block table with ref counts:
PS₀ (refcount=3), PS₁ (refcount=3), PS₂ (refcount=3)
PA₃ (refcount=1), PA₄ (refcount=1), ...
If Req A modifies a shared block → copy it first, then modify
Prefix caching extends this to cache common prefixes across requests, enabling instant prefill for repeated system prompts.
vLLM System Architecture
═══════════════════════════════════════════════════════════════
┌─────────────┐ ┌──────────────────────────────────────┐
│ API │ │ vLLM Engine │
│ Server │───▶│ │
│ (FastAPI) │ │ ┌──────────┐ ┌──────────────────┐ │
└─────────────┘ │ │Scheduler │ │ Block Manager │ │
│ │ │──│ │ │
│ │ Waiting │ │ Allocate/Free │ │
│ │ Running │ │ blocks per req │ │
│ │ Swapped │ │ CoW ref counting │ │
│ └──────────┘ └──────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Model Runner │ │
│ │ │ │
│ │ Prepare inputs │ │
│ │ Run model forward │ │
│ │ Sample tokens │ │
│ │ PagedAttention │ │
│ │ kernel calls │ │
│ └──────────────────┘ │
└──────────────────────────────────────┘
Scheduler states per request:
─────────────────────────────
WAITING → queued, no GPU blocks allocated yet
RUNNING → has GPU blocks, actively generating
SWAPPED → GPU blocks moved to CPU (preemption)
FINISHED → completed, blocks freed
Each iteration of the vLLM engine runs one "step":
# Simplified vLLM scheduler loop (pseudocode)
class Scheduler:
def __init__(self, block_manager, max_num_seqs):
self.waiting = [] # requests not yet started
self.running = [] # actively generating
self.swapped = [] # preempted to CPU
def schedule(self):
"""Decide which requests to run this step."""
scheduled = []
# 1. Try to swap in previously preempted requests
while self.swapped and self.block_manager.can_swap_in(self.swapped[0]):
req = self.swapped.pop(0)
self.block_manager.swap_in(req)
scheduled.append(req)
# 2. Continue running existing requests
for req in self.running:
if self.block_manager.can_append_token(req):
scheduled.append(req)
else:
# No space! Preempt lowest-priority request
victim = self._find_preempt_victim(scheduled)
self.block_manager.swap_out(victim)
self.swapped.append(victim)
scheduled.remove(victim)
scheduled.append(req)
# 3. Admit new requests from waiting queue
while self.waiting and self.block_manager.can_allocate(self.waiting[0]):
req = self.waiting.pop(0)
self.block_manager.allocate(req)
scheduled.append(req)
return scheduled # These all run in one batched forward pass
vLLM implements iteration-level continuous batching — the batch composition changes every single decode step:
Continuous Batching Timeline
═══════════════════════════════════════════════════════════════
Step 1: Batch = [A_decode, B_decode, C_prefill]
▲ ▲
ongoing decodes new request C enters
Step 2: Batch = [A_decode, B_decode, C_decode]
C now decoding (prefill done)
Step 3: Batch = [A_decode, C_decode, D_prefill]
▲ ▲
B finished! D enters immediately
(blocks freed) (no waiting!)
Step 4: Batch = [A_decode, C_decode, D_decode, E_prefill]
▲
E enters too
Result:
─────────
✓ GPU never idles — empty slots filled instantly
✓ No head-of-line blocking — short requests don't wait for long ones
✓ Memory recycled immediately when requests finish
Key scheduling decision: chunked prefill. Long prefill requests can monopolize the GPU for hundreds of milliseconds, stalling all decode requests. vLLM splits prefill into chunks interleaved with decode steps:
Chunked Prefill (vLLM)
═══════════════════════════════════════════════════════════════
Without chunking:
|←──── C prefill (2048 tok, 200ms) ────→| Decode A,B stalled!
With chunking (chunk=256):
|C₁|A,B|C₂|A,B|C₃|A,B|C₄|A,B|C₅|A,B|C₆|A,B|C₇|A,B|C₈|A,B|
▲ ▲
256 tok prefill chunk done
Decode latency for A,B: bounded by chunk processing time
TTFT for C: slightly higher, but decode requests stay responsive
# vLLM benchmark script — compare with HuggingFace and TGI
# Requires: pip install vllm
from vllm import LLM, SamplingParams
import time
def benchmark_vllm(model_name="meta-llama/Llama-2-7b-hf",
num_prompts=100,
input_len=512,
output_len=128):
"""Benchmark vLLM throughput and latency."""
# Initialize vLLM engine
llm = LLM(
model=model_name,
tensor_parallel_size=1,
gpu_memory_utilization=0.9, # Use 90% of GPU for KV cache
max_model_len=4096,
block_size=16, # PagedAttention block size
enable_prefix_caching=True, # Enable prefix sharing
)
sampling_params = SamplingParams(
temperature=0.0, # greedy for deterministic benchmarks
max_tokens=output_len,
)
# Generate synthetic prompts
prompts = [f"Write a detailed essay about topic number {i}. "
f"{'padding ' * (input_len // 2)}" for i in range(num_prompts)]
# Warm-up
llm.generate(prompts[:2], sampling_params)
# Benchmark
start = time.perf_counter()
outputs = llm.generate(prompts, sampling_params)
elapsed = time.perf_counter() - start
total_input_tokens = sum(len(o.prompt_token_ids) for o in outputs)
total_output_tokens = sum(len(o.outputs[0].token_ids) for o in outputs)
print(f"Model: {model_name}")
print(f"Requests: {num_prompts}")
print(f"Total time: {elapsed:.2f} sec")
print(f"Throughput: {num_prompts/elapsed:.1f} req/sec")
print(f"Input tok/sec: {total_input_tokens/elapsed:.0f}")
print(f"Output tok/sec: {total_output_tokens/elapsed:.0f}")
print(f"Total tok/sec: {(total_input_tokens+total_output_tokens)/elapsed:.0f}")
# Per-request latency distribution
latencies = [o.metrics.finished_time - o.metrics.arrival_time
for o in outputs if o.metrics]
if latencies:
import statistics
print(f"\nLatency (p50): {statistics.median(latencies)*1000:.0f} ms")
print(f"Latency (p99): {sorted(latencies)[int(0.99*len(latencies))]*1000:.0f} ms")
# Uncomment to run:
# benchmark_vllm()
| System | Throughput (tok/s) | Memory Util. | Continuous Batch | PagedAttention |
|---|---|---|---|---|
HuggingFace generate() |
~500 | ~30% | No | No |
| Text Generation Inference | ~2,000 | ~60% | Yes | No |
| vLLM | ~5,000 | ~95% | Yes | Yes |
| TensorRT-LLM | ~6,000 | ~93% | Yes | Paged KV |
| SGLang | ~5,500 | ~95% | Yes | RadixAttention |
Approximate numbers for LLaMA-2 7B, A100 80GB, batch of 64, 512 input + 128 output tokens.
class MiniBlockManager:
"""Simplified block manager to understand PagedAttention allocation."""
def __init__(self, num_blocks: int, block_size: int):
self.block_size = block_size
self.free_blocks = list(range(num_blocks))
self.block_tables = {} # request_id → list of block indices
self.ref_counts = {} # block_idx → reference count
def allocate(self, request_id: str, num_tokens: int):
"""Allocate blocks for a new request."""
num_blocks_needed = (num_tokens + self.block_size - 1) // self.block_size
if len(self.free_blocks) < num_blocks_needed:
raise MemoryError(f"Need {num_blocks_needed} blocks, "
f"only {len(self.free_blocks)} free")
blocks = [self.free_blocks.pop(0) for _ in range(num_blocks_needed)]
self.block_tables[request_id] = blocks
for b in blocks:
self.ref_counts[b] = 1
return blocks
def append_token(self, request_id: str):
"""Append a token — may need a new block."""
blocks = self.block_tables[request_id]
current_tokens = len(blocks) * self.block_size # simplified
# In reality, track partially-filled last block
if not self.free_blocks:
raise MemoryError("No free blocks for token append")
new_block = self.free_blocks.pop(0)
blocks.append(new_block)
self.ref_counts[new_block] = 1
def free(self, request_id: str):
"""Free all blocks for a completed request."""
for block in self.block_tables.pop(request_id):
self.ref_counts[block] -= 1
if self.ref_counts[block] == 0:
self.free_blocks.append(block)
del self.ref_counts[block]
def share_prefix(self, src_id: str, dst_id: str, n_prefix_blocks: int):
"""CoW: share prefix blocks between two requests."""
src_blocks = self.block_tables[src_id][:n_prefix_blocks]
self.block_tables[dst_id] = src_blocks.copy()
for b in src_blocks:
self.ref_counts[b] += 1
def utilization(self):
total = len(self.free_blocks) + sum(1 for _ in self.ref_counts)
used = sum(1 for _ in self.ref_counts)
return used / total if total > 0 else 0
# Exercise: simulate 50 concurrent requests arriving and leaving
We've optimized how we store the cache and how we schedule requests. But there's still the fundamental bottleneck: autoregressive generation requires N sequential forward passes for N tokens. What if you could generate 5 tokens in the time of 1? Day 60: Speculative Decoding introduces draft-and-verify algorithms that break the sequential barrier by predicting multiple tokens in parallel and verifying them in a single forward pass.