Phase II · Week 3 · Day 18 of 70 · 2.5 hours
"The polyhedral model turns 'can I legally reorder these loops?' from a gut feeling into a theorem you can prove with linear algebra."
| ← Previous | Next → | 📅 Week | 🔷 Phase | 📚 Curriculum |
|---|---|---|---|---|
| Day 17: Graph-Level Optimizations | Day 19: Loop Tiling & Scheduling | Week 3: IRs & Passes | Phase II: Compiler Fundamentals | ML Compilers |
Graph-level optimizations decide which operators to fuse. But once you've decided to fuse a matmul + bias + relu into one kernel, you still need to decide how to execute the loops inside that kernel — which loop goes outermost, how to tile for cache, whether to parallelize across GPU threads. The polyhedral model is the mathematical framework that tells you which loop transformations are legal (preserve correctness) and which are profitable (improve performance). TVM, Halide, MLIR's Affine dialect, and Pluto all build on this foundation.
Consider matrix multiply — three nested loops with a specific data access pattern:
// Naive matmul: C[i][j] += A[i][k] * B[k][j]
for (int i = 0; i < M; i++)
for (int j = 0; j < N; j++)
for (int k = 0; k < K; k++)
C[i][j] += A[i][k] * B[k][j];
Questions a compiler must answer:
1. Can we reorder loops (e.g., i,k,j instead of i,j,k)?
2. Can we tile loops (e.g., process 32×32 blocks)?
3. Can we parallelize the outer loop?
4. Can we vectorize the inner loop?
The polyhedral model answers all of these by modeling the computation as a geometric object in an integer lattice.
The set of all loop iterations, represented as an integer polyhedron — a set of integer points bounded by affine inequalities.
For the matmul above:
$$\mathcal{D}_S = \{(i, j, k) \in \mathbb{Z}^3 \mid 0 \leq i < M, \; 0 \leq j < N, \; 0 \leq k < K\}$$
This is a 3D rectangular box in iteration space:
k ▲
│ ┌────────────┐
│ / /│
K ───┤ / / │
│ / / │
│┌────────────┐ │
││ │ │
││ Iteration │ /
││ Domain │ / N
││ │/
│└────────────┘
└──────────────────▶ j
/
/ i
▼
M
Each memory access is an affine function of the loop iterators:
Statement: C[i][j] += A[i][k] * B[k][j] |
Access | Affine Map |
|---|---|---|
| Read A | $A[i][k]$ | $(i, j, k) \mapsto (i, k)$ |
| Read B | $B[k][j]$ | $(i, j, k) \mapsto (k, j)$ |
| Read/Write C | $C[i][j]$ | $(i, j, k) \mapsto (i, j)$ |
In matrix form, the access to $A[i][k]$:
$$\begin{pmatrix} \text{row} \\ \text{col} \end{pmatrix} = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 0 & 1 \end{pmatrix} \begin{pmatrix} i \\ j \\ k \end{pmatrix}$$
A schedule maps each iteration point to a timestamp — when it should execute. The original schedule for i,j,k order:
$$\theta(i, j, k) = (i, j, k)$$
Meaning: iteration $(i, j, k)$ executes at time $(i, j, k)$ in lexicographic order. A different schedule, say loop order i,k,j:
$$\theta'(i, j, k) = (i, k, j)$$
A schedule is legal if it preserves all data dependencies.
Two iterations depend on each other if they access the same memory location and at least one is a write. For matmul:
C[i][j] += A[i][k] * B[k][j];
// ^^ write ^^ same statement, different k
Iteration $(i, j, k_1)$ and $(i, j, k_2)$ both write to $C[i][j]$, creating a dependence:
$$\delta = (i, j, k_2) - (i, j, k_1) = (0, 0, k_2 - k_1)$$
The dependence vector is $(0, 0, +)$ — positive in the $k$ direction. This means: - Loops $i$ and $j$ are independent — can be freely reordered, parallelized, or tiled - Loop $k$ carries a dependence — iterations must execute in $k$-order (or use atomic reduction)
Dependence Analysis for Matmul
───────────────────────────────
Statement S: C[i][j] += A[i][k] * B[k][j]
Dependencies (S → S):
Flow (RAW): (i,j,k) → (i,j,k+1) on C[i][j]
direction: (0, 0, +)
Conclusion:
i-loop: parallelizable ✓
j-loop: parallelizable ✓
k-loop: reduction (must be sequential or atomic)
Legal loop orders: any permutation that keeps k innermost
✓ i, j, k ✓ j, i, k
✗ k, i, j (would violate k-dependence if k is now parallel)
A schedule $\theta'$ is legal if and only if, for every dependence $(s_1, s_2)$ with dependence vector $\delta$:
$$\theta'(s_2) - \theta'(s_1) \succ \mathbf{0} \quad \text{(lexicographically positive)}$$
This ensures that the dependent iteration $s_2$ always executes after $s_1$.
Any loop transformation expressible as an affine function of the original iterators is a polyhedral transformation:
Swap loop $i$ and $j$:
$$T_{\text{interchange}} = \begin{pmatrix} 0 & 1 & 0 \\ 1 & 0 & 0 \\ 0 & 0 & 1 \end{pmatrix}, \quad \begin{pmatrix} i' \\ j' \\ k' \end{pmatrix} = T \begin{pmatrix} i \\ j \\ k \end{pmatrix} = \begin{pmatrix} j \\ i \\ k \end{pmatrix}$$
Split loop $i$ into blocks of size $B$:
$$i = B \cdot i_o + i_i, \quad 0 \leq i_i < B$$
The tiled loop nest io, jo, ii, ji, k:
for (int io = 0; io < M/B; io++)
for (int jo = 0; jo < N/B; jo++)
for (int ii = 0; ii < B; ii++)
for (int ji = 0; ji < B; ji++)
for (int k = 0; k < K; k++)
C[io*B+ii][jo*B+ji] += A[io*B+ii][k] * B[k][jo*B+ji];
This is an affine transformation from $(i, j, k) \to (i_o, j_o, i_i, j_i, k)$:
$$i_o = \lfloor i / B \rfloor, \quad i_i = i \mod B, \quad j_o = \lfloor j / B \rfloor, \quad j_i = j \mod B$$
Used to expose parallelism in wavefront patterns:
$$\begin{pmatrix} i' \\ j' \end{pmatrix} = \begin{pmatrix} 1 & 0 \\ 1 & 1 \end{pmatrix} \begin{pmatrix} i \\ j \end{pmatrix}$$
This transforms a diagonal dependence pattern into one where outer iterations are independent.
ISL is the standard library for polyhedral operations. TVM, MLIR, Pluto, and PPCG all use it.
# Python bindings (islpy)
import islpy as isl
# Define an iteration domain
domain = isl.Set("{ S[i,j,k] : 0 <= i < 128 and 0 <= j < 128 and 0 <= k < 64 }")
print(f"Domain: {domain}")
print(f"Dimension: {domain.dim(isl.dim_type.set)}")
# Define an access relation
read_A = isl.Map("{ S[i,j,k] -> A[i,k] }")
print(f"Read A: {read_A}")
# Apply domain to get accessed elements
accessed = read_A.intersect_domain(domain)
accessed_set = accessed.range()
print(f"A elements accessed: {accessed_set}")
# Define a schedule (original order i,j,k)
schedule = isl.Map("{ S[i,j,k] -> [i,j,k] }")
# Define a tiled schedule
tiled = isl.Map("{ S[i,j,k] -> [floor(i/32), floor(j/32), i - 32*floor(i/32), j - 32*floor(j/32), k] }")
print(f"Tiled schedule: {tiled}")
# Dependence: S(i,j,k) writes C[i,j], S(i,j,k') reads C[i,j] where k' > k
write_C = isl.Map("{ S[i,j,k] -> C[i,j] }")
read_C = isl.Map("{ S[i,j,k] -> C[i,j] }")
# RAW dependence: same C element, source before sink
# In ISL notation:
dep = isl.Map("{ S[i,j,k] -> S[i,j,k'] : k' > k and 0 <= k < 64 and 0 <= k' < 64 }")
print(f"Dependence: {dep}")
# Check if a transformation is legal:
# New schedule must map source before sink for all dependencies
TVM's schedule primitives are polyhedral transformations with user-friendly names:
| TVM Primitive | Polyhedral Transformation | Effect |
|---|---|---|
split(i, 32) |
Strip-mine axis $i$ by factor 32 | Creates i_outer, i_inner |
reorder(j, i) |
Loop interchange | Swaps execution order |
tile(i, j, 32, 32) |
2D strip-mine + interchange | Creates io, jo, ii, ji |
fuse(i, j) |
Collapse dimensions | Linearize two loops |
parallel(i) |
Mark as independent | Execute on threads |
vectorize(i) |
Mark innermost as SIMD | Use vector instructions |
compute_at(B, i) |
Change producer schedule | Fuse producer into consumer loop |
# TVM example (conceptual)
import tvm
from tvm import te
M, N, K = 1024, 1024, 1024
A = te.placeholder((M, K), name='A')
B = te.placeholder((K, N), name='B')
k = te.reduce_axis((0, K), name='k')
C = te.compute((M, N), lambda i, j: te.sum(A[i, k] * B[k, j], axis=k), name='C')
s = te.create_schedule(C.op)
# Polyhedral transformations via TVM API
i, j = s[C].op.axis
k, = s[C].op.reduce_axis
# Tile: strip-mine i and j
io, ii = s[C].split(i, factor=32)
jo, ji = s[C].split(j, factor=32)
# Reorder: interchange for locality
s[C].reorder(io, jo, ii, ji, k)
# Parallelize outer loops
s[C].parallel(io)
# Vectorize innermost
s[C].vectorize(ji)
print(tvm.lower(s, [A, B, C], simple_mode=True))
For each loop nest, determine the dependence vectors and which loops can be parallelized:
// (a) Element-wise: Y[i] = X[i] * 2
for (int i = 0; i < N; i++)
Y[i] = X[i] * 2;
// (b) Stencil: Y[i] = X[i-1] + X[i] + X[i+1]
for (int i = 1; i < N-1; i++)
Y[i] = X[i-1] + X[i] + X[i+1];
// (c) Scan (prefix sum): Y[i] = Y[i-1] + X[i]
for (int i = 1; i < N; i++)
Y[i] = Y[i-1] + X[i];
// (d) 2D convolution
for (int i = 0; i < H; i++)
for (int j = 0; j < W; j++)
for (int ki = 0; ki < 3; ki++)
for (int kj = 0; kj < 3; kj++)
O[i][j] += I[i+ki][j+kj] * K[ki][kj];
Expected answers: | Case | Deps | Parallelizable? | |:-----|:-----|:---------------| | (a) | None (X≠Y) | Fully parallel | | (b) | ? | ? | | (c) | ? | ? | | (d) | ? | ? |
import numpy as np
import time
def matmul_naive(A, B):
M, K = A.shape
_, N = B.shape
C = np.zeros((M, N))
for i in range(M):
for j in range(N):
for k in range(K):
C[i][j] += A[i][k] * B[k][j]
return C
def matmul_tiled(A, B, block=32):
M, K = A.shape
_, N = B.shape
C = np.zeros((M, N))
for io in range(0, M, block):
for jo in range(0, N, block):
for ko in range(0, K, block):
# Tile bounds
ie = min(io + block, M)
je = min(jo + block, N)
ke = min(ko + block, K)
for ii in range(io, ie):
for ji in range(jo, je):
for ki in range(ko, ke):
C[ii][ji] += A[ii][ki] * B[ki][ji]
return C
# Test correctness and measure performance
N = 256 # Keep small for pure Python
A = np.random.randn(N, N)
B = np.random.randn(N, N)
t0 = time.perf_counter()
C1 = matmul_naive(A, B)
t_naive = time.perf_counter() - t0
t0 = time.perf_counter()
C2 = matmul_tiled(A, B, block=32)
t_tiled = time.perf_counter() - t0
print(f"Naive: {t_naive:.3f}s")
print(f"Tiled: {t_tiled:.3f}s")
print(f"Speedup: {t_naive/t_tiled:.2f}x")
print(f"Correct: {np.allclose(C1, C2)}")
# pip install islpy
import islpy as isl
# 1. Create the iteration domain for 2D convolution
# i: 0..63, j: 0..63, ki: 0..2, kj: 0..2
domain = isl.Set("{ S[i,j,ki,kj] : 0<=i<64 and 0<=j<64 and 0<=ki<3 and 0<=kj<3 }")
# 2. Write the access relation for I[i+ki][j+kj]
access_I = isl.Map("{ S[i,j,ki,kj] -> I[i+ki, j+kj] }")
# 3. What is the footprint? (set of I elements accessed)
footprint = access_I.intersect_domain(domain).range()
print(f"I footprint: {footprint}")
# Should be { I[r,c] : 0 <= r <= 66 and 0 <= c <= 66 }
# 4. Create a tiled schedule
tiled = isl.Map("{ S[i,j,ki,kj] -> [floor(i/8), floor(j/8), "
"i - 8*floor(i/8), j - 8*floor(j/8), ki, kj] }")
# 5. What is the data reuse within one tile?
# Compute the I footprint for a single tile (io=0, jo=0)
tile_domain = isl.Set("{ S[i,j,ki,kj] : 0<=i<8 and 0<=j<8 and 0<=ki<3 and 0<=kj<3 }")
tile_footprint = access_I.intersect_domain(tile_domain).range()
print(f"Tile footprint: {tile_footprint}")
# Should be { I[r,c] : 0 <= r <= 10 and 0 <= c <= 10 } = 11x11 = 121 elements
# But total accesses = 8*8*3*3 = 576 → reuse ratio = 576/121 ≈ 4.8x
split, reorder, tile, fuse) are exactly polyhedral transformations with user-friendly APIsDay 19 puts the polyhedral model to work with loop tiling and scheduling — we'll tile matrix multiply for L1/L2 cache, implement multi-level tiling strategies, explore TVM's AutoScheduler (Ansor), and see how autotuning searches the space of legal schedules to find hardware-optimal configurations automatically.