Phase IV · Week 9 · Day 61 of 70 · 2.5 hours
"The difference between a 140 GB model that needs a cluster and a 4 GB model that runs on a laptop is quantization. Same knowledge, different precision."
| ← Previous | Next → | 📅 Week | 🔷 Phase | 📚 Curriculum |
|---|---|---|---|---|
| Day 60: Speculative Decoding | Day 62: Serving Frameworks Comparison | Week 9: LLM Serving Systems | Phase IV: Inference & Deployment | ML Compilers |
A 70B-parameter LLM in FP16 weighs ~140 GB — that's two A100-80GBs just for weights alone, before any KV cache. Quantizing to INT4 shrinks it to ~35 GB, fitting on a single GPU with room to spare. But LLMs aren't ordinary neural networks: they exhibit activation outliers — channels where values are 100× larger than average — that make naive quantization catastrophic. Specialized methods like GPTQ, AWQ, and GGUF solve this problem, each with different tradeoffs between accuracy, speed, and flexibility. Understanding these methods is essential because quantization is the single highest-impact deployment optimization: it simultaneously reduces memory, increases throughput, and lowers cost.
Standard per-tensor INT8 quantization works well for CNNs and small transformers. It fails for LLMs because of outlier channels discovered by Dettmers et al. (LLM.int8(), 2022):
Activation Outlier Problem in LLMs
═══════════════════════════════════════════════════════════════
Normal activations (99.9% of channels):
┌─────────────────────────────────────────┐
│ Values: [-0.5, 0.3, -0.1, 0.4, -0.2] │ Range: ±1
└─────────────────────────────────────────┘
Outlier channels (0.1% of channels, but CRITICAL):
┌─────────────────────────────────────────┐
│ Values: [-60.0, 80.0, -45.0, 120.0] │ Range: ±120
└─────────────────────────────────────────┘
Naive per-tensor INT8 quantization:
┌─────────────────────────────────────────────────────────┐
│ Scale = 120 / 127 ≈ 0.945 │
│ │
│ Normal: 0.3 / 0.945 = 0.32 → round to 0 │
│ Normal: -0.5 / 0.945 = -0.53 → round to -1 │
│ Outlier: 120.0 / 0.945 = 127 → maps correctly │
│ │
│ Problem: Normal values get quantized to {-1, 0, 1}! │
│ → 99.9% of activations lose almost all information │
└─────────────────────────────────────────────────────────┘
Solution approaches:
┌────────────────┐ ┌──────────────┐ ┌─────────────────┐
│ LLM.int8() │ │ GPTQ │ │ AWQ │
│ Mixed FP16/ │ │ Hessian- │ │ Protect salient│
│ INT8 by │ │ guided │ │ weights by │
│ channel │ │ rounding │ │ activation │
└────────────────┘ └──────────────┘ └─────────────────┘
The key insight: these outliers carry disproportionate information. Removing 6 outlier dimensions from a 5120-dim hidden state (0.12%) can drop accuracy by 20+ perplexity points.
GPTQ (Frantar et al., 2023) quantizes weights layer by layer, using second-order information (the Hessian) to minimize the quantization error:
$$\min_{\hat{W}} \|WX - \hat{W}X\|_2^2$$
where $X$ is a calibration dataset, $W$ is the original weight, and $\hat{W}$ is the quantized weight.
GPTQ builds on Optimal Brain Quantization (OBQ). For each weight column, it:
$$\delta_F = -\frac{w_q - \text{quant}(w_q)}{[H_F^{-1}]_{qq}} \cdot (H_F^{-1})_{:,q}$$
import torch
import torch.nn as nn
def gptq_quantize_layer(
weight: torch.Tensor, # [out_features, in_features]
hessian: torch.Tensor, # [in_features, in_features] = X @ X^T
bits: int = 4,
group_size: int = 128,
block_size: int = 128, # columns processed together
):
"""Simplified GPTQ quantization for one linear layer.
Args:
weight: Original FP16 weight matrix
hessian: H = 2 * X @ X^T from calibration data
bits: Target bit width (typically 4)
group_size: Number of columns sharing a scale factor
block_size: Columns processed in one Cholesky block
"""
W = weight.clone().float()
n_rows, n_cols = W.shape
# Quantization grid for N bits
qmin, qmax = 0, 2**bits - 1 # asymmetric: [0, 15] for 4-bit
# Cholesky decomposition of Hessian (with damping)
damp = 0.01 * torch.diag(hessian).mean()
H = hessian + damp * torch.eye(n_cols, device=W.device)
H_inv = torch.linalg.cholesky(H)
H_inv = torch.cholesky_inverse(H_inv)
quantized = torch.zeros_like(W)
scales = torch.zeros(n_rows, n_cols // group_size, device=W.device)
zeros = torch.zeros_like(scales)
# Process columns in blocks
for block_start in range(0, n_cols, block_size):
block_end = min(block_start + block_size, n_cols)
W_block = W[:, block_start:block_end].clone()
H_block = H_inv[block_start:block_end, block_start:block_end]
error = torch.zeros_like(W_block)
for col in range(block_end - block_start):
global_col = block_start + col
# Compute per-group scale
group_idx = global_col // group_size
if global_col % group_size == 0:
group_end = min(global_col + group_size, n_cols)
w_group = W[:, global_col:group_end]
s = (w_group.max(dim=1).values - w_group.min(dim=1).values) / qmax
z = torch.round(-w_group.min(dim=1).values / s)
scales[:, group_idx] = s
zeros[:, group_idx] = z
s = scales[:, group_idx]
z = zeros[:, group_idx]
# Quantize this column
w_col = W_block[:, col]
q_col = torch.clamp(torch.round(w_col / s + z), qmin, qmax)
quantized[:, global_col] = q_col
# Compute rounding error
w_hat = (q_col - z) * s
err = (w_col - w_hat) / H_block[col, col]
# Distribute error to remaining columns in this block
W_block[:, col+1:] -= err.unsqueeze(1) * H_block[col, col+1:].unsqueeze(0)
# Distribute block error to remaining columns
W[:, block_end:] -= error @ H_inv[block_start:block_end, block_end:]
return quantized, scales, zeros
| Model | Bits | Perplexity (Wiki2) | Size | Speed (A100) |
|---|---|---|---|---|
| LLaMA-2 7B FP16 | 16 | 5.47 | 13.5 GB | 1.0× |
| LLaMA-2 7B GPTQ | 4 | 5.63 (+0.16) | 3.9 GB | 2.8× |
| LLaMA-2 7B GPTQ | 3 | 6.29 (+0.82) | 3.0 GB | 3.2× |
| LLaMA-2 70B FP16 | 16 | 3.32 | 137 GB | 1.0× |
| LLaMA-2 70B GPTQ | 4 | 3.48 (+0.16) | 37 GB | 3.5× |
AWQ (Lin et al., 2024) observes that not all weights are equally important. Weights connected to large-magnitude activation channels carry more information and should be quantized more carefully:
$$\text{Quantization error} \propto |w \cdot s_x|$$
where $s_x$ is the average activation magnitude for that channel.
AWQ Core Insight
═══════════════════════════════════════════════════════════════
Weight matrix W: [out_features × in_features]
Activation magnitudes per channel: s = mean(|X|, dim=batch)
Channel importance = |weight| × |activation|
┌──────────────────────────────────────────────────────────┐
│ Channel 42: w = 0.02, s_x = 0.3 → impact: 0.006 │ Low
│ Channel 107: w = 0.05, s_x = 80.0 → impact: 4.0 │ HIGH!
│ Channel 512: w = 3.00, s_x = 0.1 → impact: 0.3 │ Medium
└──────────────────────────────────────────────────────────┘
AWQ strategy: Scale up salient channels BEFORE quantizing
─────────────────────────────────────────────────────────
For salient channel i:
w_i' = w_i × α (scale weight UP → more quantization levels)
x_i' = x_i / α (scale activation DOWN → preserves W·X)
Net effect: W' · X' = W · X (mathematically equivalent)
But salient weights now use more of the INT4 range!
The scale factor $\alpha$ per channel is found by grid search:
$$\alpha^* = \arg\min_\alpha \|Q(W \cdot \text{diag}(\alpha)) \cdot \text{diag}(\alpha)^{-1} \cdot X - W \cdot X\|$$
def awq_compute_scales(
weight: torch.Tensor, # [out, in]
activations: torch.Tensor, # [n_samples, in]
bits: int = 4,
grid_size: int = 20,
):
"""Find optimal per-channel scaling factors for AWQ.
Key insight: scale up important channels before quantizing,
scale down activations to compensate → same output, better precision.
"""
n_samples = activations.shape[0]
in_features = weight.shape[1]
# Channel importance = average activation magnitude
act_scales = activations.abs().mean(dim=0) # [in_features]
# Original output for reference
original_out = weight @ activations.T # [out, n_samples]
best_scales = torch.ones(in_features, device=weight.device)
best_error = float('inf')
# Grid search over scale exponents
for ratio in torch.linspace(0, 1, grid_size):
# Scale = act_scales^ratio (only scales important channels)
scales = act_scales.pow(ratio).clamp(min=1e-4)
# Scale weights up, quantize, scale back
w_scaled = weight * scales.unsqueeze(0)
w_quant = pseudo_quantize(w_scaled, bits=bits)
w_dequant = w_quant / scales.unsqueeze(0)
# Measure error on calibration data
quant_out = w_dequant @ activations.T
error = (original_out - quant_out).pow(2).mean()
if error < best_error:
best_error = error
best_scales = scales.clone()
return best_scales
def pseudo_quantize(w: torch.Tensor, bits: int = 4, group_size: int = 128):
"""Simulate quantization (quantize then dequantize)."""
qmax = 2**bits - 1
# Per-group quantization
orig_shape = w.shape
w = w.reshape(-1, group_size)
w_min = w.min(dim=1, keepdim=True).values
w_max = w.max(dim=1, keepdim=True).values
scale = (w_max - w_min) / qmax
w_q = torch.round((w - w_min) / scale).clamp(0, qmax)
w_deq = w_q * scale + w_min
return w_deq.reshape(orig_shape)
| Aspect | GPTQ | AWQ |
|---|---|---|
| Approach | Hessian-guided rounding + error compensation | Activation-aware channel scaling |
| Calibration | ~128 samples, ~5 min for 7B | ~128 samples, ~2 min for 7B |
| Accuracy (4-bit) | Very good | Slightly better on most benchmarks |
| Speed (inference) | Fast (optimized kernels) | Fast (AutoAWQ kernels) |
| Robustness | Can be sensitive to calibration data | More robust across domains |
| Key advantage | Error redistribution across weights | Protects salient channels |
GGUF (GPT-Generated Unified Format) is the native format for llama.cpp, designed for CPU and edge inference. It supports a rich menu of quantization types:
GGUF Quantization Types — The Precision Ladder
═══════════════════════════════════════════════════════════════
Type Bits Size/7B Perplexity Use Case
─────────────────────────────────────────────────────────
F16 16.0 13.5 GB 5.47 Reference baseline
Q8_0 8.5 7.2 GB 5.48 Near-lossless
Q6_K 6.6 5.5 GB 5.50 Best quality below 8-bit
Q5_K_M 5.7 4.8 GB 5.52 ★ Best accuracy/size ratio
Q5_K_S 5.5 4.6 GB 5.54 Slightly smaller
Q4_K_M 4.8 4.1 GB 5.63 ★ Most popular choice
Q4_K_S 4.6 3.9 GB 5.68 Good for smaller RAM
Q4_0 4.5 3.8 GB 5.72 Basic 4-bit (legacy)
Q3_K_M 3.9 3.3 GB 6.15 Noticeable degradation
Q3_K_S 3.5 3.0 GB 6.42 Aggressive compression
Q2_K 3.3 2.8 GB 8.50 Research only
IQ4_XS 4.3 3.7 GB 5.60 ★ imatrix, best <4GB
IQ3_XXS 3.1 2.6 GB 6.80 imatrix, extreme
IQ2_XXS 2.1 1.8 GB 12.00 Pushing limits
─────────────────────────────────────────────────────────
K-quants: "K" = k-means clustering for optimal quantization levels
S/M/L: Small/Medium/Large — more bits for attention layers
IQ: Importance-matrix guided quantization (imatrix)
┌────────────────────────────────────────────────────────┐
│ Rule of thumb: │
│ • Plenty of RAM → Q5_K_M (best quality per GB) │
│ • Tight RAM → Q4_K_M (sweet spot) │
│ • Extreme edge → IQ4_XS with imatrix │
└────────────────────────────────────────────────────────┘
K-quants don't use uniform bit width — they allocate more bits to sensitive layers:
K-Quant Mixed Precision Strategy
═══════════════════════════════════════════════════════════════
Q4_K_M allocation for a 32-layer transformer:
Layer Type Bits Why
──────────────────────────────────────────────────
Embedding 6 First/last layers are sensitive
Attention Q,K 4 Can tolerate some noise
Attention V 5 Values carry content information
Attention Output 5 Aggregation layer, moderate
FFN Gate/Up 4 Largest matrices, most savings
FFN Down 5 Output projection, moderate
Final LayerNorm 32 Tiny, keep full precision
LM Head 6 Classification layer, sensitive
──────────────────────────────────────────────────
Result: "4-bit" model actually uses 4.8 bits average
→ Q4_K_M vs Q4_0: same name, 0.1 better perplexity
# Step 1: Convert HuggingFace model to GGUF
python convert_hf_to_gguf.py \
/models/llama-2-7b-hf \
--outfile llama-2-7b-f16.gguf \
--outtype f16
# Step 2: Quantize (basic)
./llama-quantize llama-2-7b-f16.gguf llama-2-7b-Q4_K_M.gguf Q4_K_M
# Step 3: Quantize with importance matrix (better quality!)
# First, generate importance matrix from calibration data
./llama-imatrix \
-m llama-2-7b-f16.gguf \
-f wiki.train.raw \
-o imatrix.dat \
--chunks 200
# Then quantize using importance data
./llama-quantize \
--imatrix imatrix.dat \
llama-2-7b-f16.gguf \
llama-2-7b-IQ4_XS.gguf \
IQ4_XS
# Step 4: Test perplexity
./llama-perplexity \
-m llama-2-7b-Q4_K_M.gguf \
-f wiki.test.raw
SqueezeLLM (Kim et al., 2024) observes that weight distributions are non-uniform and uses k-means clustering to find optimal quantization centroids instead of uniform grid points:
$$\min_{\{c_1, \ldots, c_{2^b}\}} \sum_i s_i \cdot (w_i - c_{k_i})^2$$
where $s_i$ is the sensitivity (Hessian diagonal) and $c_{k_i}$ is the nearest centroid.
Uniform vs Non-Uniform Quantization
═══════════════════════════════════════════════════════════════
Weight distribution (typical LLM layer):
▄▄
▄██▄
▄████▄
▄██████▄
▄████████▄
▄▄▄▄██████████▄▄▄▄
─────────────────────────────────────── value
-0.1 0 +0.1
Uniform 4-bit (16 levels):
──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──
Evenly spaced → wastes levels on sparse tails
Non-uniform (k-means, 16 centroids):
─────────┼┼┼┼┼┼┼┼┼┼┼┼──────┼──────┼──
More levels where data is dense
→ 0.1-0.3 perplexity improvement
QuIP# (Tseng et al., 2024) randomizes the weight matrix with orthogonal transformations to spread outliers, enabling near-optimal 2-bit quantization:
$$W_{\text{quantized}} = Q_{\text{lattice}}(U^T W V)$$
where $U, V$ are random orthogonal (Hadamard) matrices and $Q_{\text{lattice}}$ is E8 lattice quantization.
| Method | 2-bit PPL | 3-bit PPL | 4-bit PPL | Speed |
|---|---|---|---|---|
| GPTQ | 107.3 | 6.29 | 5.63 | Fast |
| AWQ | — | 6.15 | 5.60 | Fast |
| SqueezeLLM | — | 5.95 | 5.54 | Moderate |
| QuIP# | 8.33 | 5.72 | 5.49 | Slow |
"""Quantization benchmark framework for LLMs."""
import time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
import numpy as np
def measure_perplexity(model, tokenizer, dataset_name="wikitext", split="test"):
"""Measure perplexity on standard benchmark."""
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split=split)
encodings = tokenizer("\n\n".join(dataset["text"]), return_tensors="pt")
max_length = model.config.max_position_embeddings
stride = 512
seq_len = encodings.input_ids.size(1)
nlls = []
for begin in range(0, seq_len, stride):
end = min(begin + max_length, seq_len)
input_ids = encodings.input_ids[:, begin:end].to(model.device)
target_ids = input_ids.clone()
target_ids[:, :-stride] = -100 # mask context
with torch.no_grad():
outputs = model(input_ids, labels=target_ids)
nlls.append(outputs.loss.item())
ppl = np.exp(np.mean(nlls))
return ppl
def measure_throughput(model, tokenizer, prompt="Hello", n_tokens=128, n_runs=5):
"""Measure tokens/second generation speed."""
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)
# Warmup
with torch.no_grad():
model.generate(input_ids, max_new_tokens=10, do_sample=False)
times = []
for _ in range(n_runs):
torch.cuda.synchronize()
start = time.perf_counter()
with torch.no_grad():
model.generate(input_ids, max_new_tokens=n_tokens, do_sample=False)
torch.cuda.synchronize()
times.append(time.perf_counter() - start)
avg_time = np.mean(times)
tokens_per_sec = n_tokens / avg_time
return tokens_per_sec, avg_time
def benchmark_report(results: dict):
"""Print comparison table."""
print(f"{'Method':<15} {'PPL':>8} {'Tok/s':>8} {'Size GB':>8} {'VRAM GB':>8}")
print("─" * 50)
for name, r in results.items():
print(f"{name:<15} {r['ppl']:>8.2f} {r['tps']:>8.1f} "
f"{r['size_gb']:>8.1f} {r['vram_gb']:>8.1f}")
"""Quantize a model with GPTQ and measure quality."""
# pip install auto-gptq transformers
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Configure quantization
quantize_config = BaseQuantizeConfig(
bits=4,
group_size=128,
desc_act=True, # Use activation order (slower but better)
damp_percent=0.01, # Hessian damping
)
# Load and quantize
model = AutoGPTQForCausalLM.from_pretrained(model_name, quantize_config)
model.quantize(
examples=[tokenizer("Example calibration text", return_tensors="pt")],
batch_size=1,
)
model.save_quantized("llama-7b-gptq-4bit")
"""Quantize with AWQ and compare to GPTQ."""
# pip install autoawq
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_name = "meta-llama/Llama-2-7b-hf"
model = AutoAWQForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM", # GEMM or GEMV kernel
}
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized("llama-7b-awq-4bit")
You now know how to make models smaller. But which framework should you actually deploy them with? Day 62: Serving Frameworks Comparison puts vLLM, TensorRT-LLM, TGI, SGLang, llama.cpp, and Triton side-by-side — architecture, performance, GPU strategies, and when each one wins.