Phase III — LLMs: Training & Alignment | Week 5 | 2.5 hours "Why update 7 billion parameters when 4 million will do?" — Edward Hu
Full fine-tuning updates every parameter in the model:
Model Size Parameters FP16 Memory Optimizer Memory Total VRAM
────────── ────────── ─────────── ──────────────── ──────────
7B 7 × 10⁹ 14 GB 42 GB (AdamW) ~56 GB
13B 13 × 10⁹ 26 GB 78 GB ~104 GB
70B 70 × 10⁹ 140 GB 420 GB ~560 GB
Problem: Most researchers have 1-2 GPUs (24-80 GB)
→ Full fine-tuning of 7B+ models is impractical
Key insight: The weight updates during fine-tuning have low intrinsic rank. Instead of updating the full weight matrix $W \in \mathbb{R}^{d \times k}$, decompose the update:
$$ W' = W_0 + \Delta W = W_0 + BA $$
where $B \in \mathbb{R}^{d \times r}$, $A \in \mathbb{R}^{r \times k}$, and $r \ll \min(d, k)$.
Full fine-tuning: LoRA:
┌────────────────────┐ ┌────────────────────┐
│ W (d × k) │ │ W₀ (d × k) │ ← frozen!
│ trainable │ │ frozen │
│ d × k params │ │ │
└────────────────────┘ │ + B(d×r) · A(r×k) │ ← trainable
│ (r << d, k) │
└────────────────────┘
d=4096, k=4096: d=4096, k=4096, r=16:
Full: 16.8M params LoRA: 2 × 4096 × 16 = 131K params
→ 128× fewer parameters!
Forward pass with LoRA:
$$ h = W_0 x + \frac{\alpha}{r} B A x $$
The $\frac{\alpha}{r}$ scaling factor controls the magnitude of the LoRA update. Typically $\alpha = 2r$.
Transformer Block:
┌─────────────────────────────────────────┐
│ Multi-Head Attention │
│ ┌───────┐ ┌───────┐ ┌───────┐ ┌─────┐ │
│ │ W_q │ │ W_k │ │ W_v │ │ W_o │ │
│ │ ✅LoRA│ │ ✅LoRA│ │ ✅LoRA│ │✅ │ │
│ └───────┘ └───────┘ └───────┘ └─────┘ │
│ │
│ Feed-Forward Network │
│ ┌──────────┐ ┌──────────┐ │
│ │ W_gate │ │ W_up │ │
│ │ ⚠️maybe │ │ ⚠️maybe │ │
│ └──────────┘ └──────────┘ │
│ ┌──────────┐ │
│ │ W_down │ │
│ │ ⚠️maybe │ │
│ └──────────┘ │
└─────────────────────────────────────────┘
Common strategy: Apply LoRA to Q, K, V, O projections
Aggressive strategy: Also apply to FFN weights
QLoRA combines two ideas: 1. Quantize the base model to 4-bit (NF4 format) — saves memory 2. Add LoRA adapters in full precision — maintains quality
$$ h = \underbrace{W_0^{\text{NF4}}}_{\text{4-bit frozen}} x + \underbrace{\frac{\alpha}{r} B A}_{\text{16-bit trainable}} x $$
Memory comparison for 7B model:
Base Model LoRA Params Optimizer Total
Full FT (FP16): 14 GB — 42 GB ~56 GB
LoRA (FP16): 14 GB ~8 MB ~24 MB ~14.1 GB
QLoRA (NF4): 3.5 GB ~8 MB ~24 MB ~3.6 GB
QLoRA: Fine-tune a 7B model on a single 4GB GPU!
NF4 (NormalFloat 4-bit): Quantization format optimized for normally-distributed weights (which neural network weights are). Better than uniform INT4.
Method Trainable Params Where How
────── ──────────────── ───── ───
Full fine-tuning 100% All weights Standard backprop
LoRA 0.1-1% Attention Low-rank matrices
QLoRA 0.1-1% Attention LoRA + 4-bit base
Adapters 1-5% After layers Small bottleneck MLPs
Prefix Tuning <0.1% Input prefix Learnable prefix tokens
Prompt Tuning <0.01% Input prefix Soft prompt embeddings
IA3 <0.01% Activations Learned scaling vectors
Prefix Tuning prepends learnable "virtual tokens" to the key and value in each attention layer:
$$ \text{Attention}(Q, [P_K; K], [P_V; V]) $$
where $P_K, P_V \in \mathbb{R}^{l \times d}$ are the learnable prefix matrices with $l$ virtual tokens.
Prompt Tuning is even simpler — only prepend to the input embeddings:
$$ \tilde{X} = [P; X] \quad \text{where } P \in \mathbb{R}^{l \times d} $$
"""
Day 35 Implementation: Compare full, LoRA, and QLoRA fine-tuning.
Uses TinyLlama for feasibility on consumer hardware.
"""
import torch
import time
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
BitsAndBytesConfig,
)
from peft import (
LoraConfig,
get_peft_model,
prepare_model_for_kbit_training,
TaskType,
)
from datasets import Dataset
MODEL_ID = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
# --- Shared dataset ---
def get_dataset() -> Dataset:
examples = [
{"text": f"<|im_start|>user\nTask {i}<|im_end|>\n"
f"<|im_start|>assistant\nResponse {i}<|im_end|>"}
for i in range(100)
]
return Dataset.from_list(examples)
# ============================================================
# Method 1: Full Fine-Tuning
# ============================================================
def setup_full_finetune():
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, torch_dtype=torch.float16, device_map="auto",
)
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
mem = torch.cuda.memory_allocated() / 1e9 if torch.cuda.is_available() else 0
return {
"method": "Full Fine-Tune",
"trainable_params": trainable,
"total_params": total,
"pct_trainable": 100.0,
"gpu_memory_gb": mem,
}
# ============================================================
# Method 2: LoRA (FP16 base)
# ============================================================
def setup_lora():
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, torch_dtype=torch.float16, device_map="auto",
)
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
bias="none",
)
model = get_peft_model(model, lora_config)
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
pct = 100 * trainable / total
mem = torch.cuda.memory_allocated() / 1e9 if torch.cuda.is_available() else 0
return {
"method": "LoRA (r=16, FP16)",
"trainable_params": trainable,
"total_params": total,
"pct_trainable": pct,
"gpu_memory_gb": mem,
}
# ============================================================
# Method 3: QLoRA (4-bit base + LoRA)
# ============================================================
def setup_qlora():
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, quantization_config=bnb_config, device_map="auto",
)
model = prepare_model_for_kbit_training(model)
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
bias="none",
)
model = get_peft_model(model, lora_config)
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
pct = 100 * trainable / total
mem = torch.cuda.memory_allocated() / 1e9 if torch.cuda.is_available() else 0
return {
"method": "QLoRA (r=16, NF4)",
"trainable_params": trainable,
"total_params": total,
"pct_trainable": pct,
"gpu_memory_gb": mem,
}
# ============================================================
# LoRA Math Verification
# ============================================================
def verify_lora_math():
"""Verify LoRA decomposition W' = W0 + BA."""
d, k, r = 512, 512, 8
W0 = torch.randn(d, k) # frozen base weight
B = torch.randn(d, r) * 0.01 # LoRA down-projection
A = torch.randn(r, k) * 0.01 # LoRA up-projection
alpha = 16
scaling = alpha / r
x = torch.randn(1, k)
# Standard computation
h_full = x @ (W0 + scaling * B @ A).T
# Decomposed (what happens at inference)
h_base = x @ W0.T
h_lora = scaling * (x @ A.T @ B.T)
h_decomposed = h_base + h_lora
diff = (h_full - h_decomposed).abs().max().item()
print(f"LoRA decomposition error: {diff:.2e}")
assert diff < 1e-5, "LoRA math verification failed!"
# Parameter savings
full_params = d * k
lora_params = d * r + r * k
savings = 100 * (1 - lora_params / full_params)
print(f"Full params: {full_params:,}")
print(f"LoRA params: {lora_params:,} ({savings:.1f}% reduction)")
if __name__ == "__main__":
print("=" * 60)
print("LoRA Math Verification")
print("=" * 60)
verify_lora_math()
print("\n" + "=" * 60)
print("Method Comparison (TinyLlama 1.1B)")
print("=" * 60)
# Run comparison (only if GPU available)
if torch.cuda.is_available():
for setup_fn in [setup_full_finetune, setup_lora, setup_qlora]:
torch.cuda.empty_cache()
info = setup_fn()
print(f"\n{info['method']}:")
print(f" Trainable: {info['trainable_params']:>12,} "
f"({info['pct_trainable']:.2f}%)")
print(f" Total: {info['total_params']:>12,}")
print(f" GPU Mem: {info['gpu_memory_gb']:.2f} GB")
else:
print("No GPU — run LoRA math verification only.")
Investigate the effect of LoRA rank $r$ on performance: 1. Train LoRA with $r \in \{2, 4, 8, 16, 32, 64\}$ on the same task 2. For each, record: trainable params, training loss, inference quality 3. Plot the Pareto frontier: quality vs. parameter count 4. What rank gives the best quality/cost trade-off?
Compare applying LoRA to different weight matrices: 1. Q+V only (original paper recommendation) 2. Q+K+V+O (all attention) 3. All attention + FFN 4. Which configuration gives the best quality per trainable parameter?
LoRA's low-rank assumption directly connects to the manifold hypothesis from Phase I: neural network weight updates lie on a low-dimensional manifold within the high-dimensional parameter space. For robotics VLAs, LoRA enables task-specific adaptation — a single base model with different LoRA adapters for different warehouse layouts, robot morphologies, or task types. This is the practical bridge between "one foundation model" and "many specialized deployments."