Attention, in seven tokens.
A transformer’s job, in one operation: for each token, look at every other token and decide how much to listen. The mechanism is three lines of math — a dot product, a softmax, a weighted sum — and most of the surprise in modern language models comes from running it in parallel, many times over.
Below, the smallest version that still earns the name. Seven tokens, four-dimensional embeddings, no learned projections. The dot products are easy to read off and the attention pattern is visibly a sentence parsing itself.
One query, every key.
Click any token below to make it the query. The same vector acts as Q, K, and V in this toy — no learned projections, just a dot product, a softmax, and a weighted sum. Watch which other tokens it pays attention to.
The output is the weighted sum of every value vector, with weights set by how aligned each key is with the query.
Every query, every key.
The same calculation, run for every query token at once. Each row is one query’s attention distribution — brighter cells are tokens it leaned on. Click a row to pull it into the table above.
sat attends most to sat (25%), cat (18%), on (16%).
Two things are deliberately missing from the toy and worth naming. The first is positional encoding: both occurrences of the have identical embeddings, so they attend identically. Real transformers add a position-dependent signal so the model can tell the cat from the mat.
The second is learned projections. Real attention does not use the embedding directly as Q, K, and V; it multiplies the embedding by three separate weight matrices first, learned during training. Each headin multi-headed attention runs the same operation with different matrices, so different heads can specialise on different relationships — subject-verb in one, coreference in another.
What the toy does keep, exactly, is the shape of the operation: similarity by dot product, normalisation by softmax, mixing by weighted sum. Stack that twelve to a hundred times and you have a transformer.