What to watch
- The query spotlight moves. Every second or so a new token becomes the query and its outgoing pattern redraws. Notice how the same token attends differently depending on what it is.
- Heads specialise. Each head has a slowly drifting "preferred relative offset" — head 0 tends to look nearby, higher heads reach further across the sequence. Turn the Heads control up and you'll see several distinct attention patterns overlaid in different hues, exactly as multi-head attention does in a real model.
- Temperature sharpens or flattens. Drag it low and each query collapses onto one near-certain key (the arcs thin to a single bright stream — the "greedy" regime where training gradients vanish). Drag it high and the distribution flattens, every token weighted near-equally (the signal washes out). The useful middle is where real models live.
The mechanism
For the active query and a candidate key , a head computes a compatibility score and normalises over all keys:
The arcs above draw directly. Add more Tokens to grow the sequence; add more Heads to stack independent attention patterns. The whole picture is the contents of a single transformer block's attention layer, running forever.