Single vs Multi-Head Attention

Attention series: 9 of 11

Apr 27, 2026

Attention: the series

The most important fact about multi-head attention is that it has the same parameter count as single-head attention. The difference is purely structural: the same total Wqkv weights, partitioned into smaller q–k–v triples.

Open the interactive diagram ↗

Look at the two diagrams below. Both Wqkv matrices have the same height: the same number of weight rows, the same number of parameters. What changes is how that single tall block is sliced.

Left. One head. The full Wqkv produces one big QKV: a tall Q (36 rows), a tall K, a tall V. One scoring computation runs over those full-width tensors.

Right. 3 heads. The same-height Wqkv is sliced into 3 smaller q–k–v triples, each 12 rows tall. 3 scoring computations run in parallel, each a thinner version of the left.

The compute trade-off (kind of). Same Wqkv weights. Multi-head runs the attention scoring S = Kᵀ × Q once per head, so the dot-product count multiplies by H.

Single-head: seq × seq = 80² = 6400 dot products
Multi-head: seq × seq × H = 80² × 3 = 19200 dot products (3×)

But each multi-head dot product is narrower: its inner dimension is head_dim instead of H × head_dim. When you count actual scalar multiplications, the totals are equal:

Single-head: seq² × (H × head_dim) = 80² × 36 = 230400
Multi-head: seq² × H × head_dim = 80² × 3 × 12 = 230400

Same total work, split across H specialized questions instead of one broad one.

Next:
10. Multi-Query Attention

AI by Hand ✍️

Discussion about this post

Ready for more?