✦AI & Machine Learning

Self-Attention

Name: Self-Attention
Author: Ben Ebsworth

Multi-head self-attention as a live particle network — query tokens cycle, heads drift, weights flow.

What to watch

The query spotlight moves. Every second or so a new token becomes the query and its outgoing pattern redraws. Notice how the same token attends differently depending on what it is.
Heads specialise. Each head has a slowly drifting "preferred relative offset" — head 0 tends to look nearby, higher heads reach further across the sequence. Turn the Heads control up and you'll see several distinct attention patterns overlaid in different hues, exactly as multi-head attention does in a real model.
Temperature sharpens or flattens. Drag it low and each query collapses onto one near-certain key (the arcs thin to a single bright stream — the "greedy" regime where training gradients vanish). Drag it high and the distribution flattens, every token weighted near-equally (the signal washes out). The useful middle is where real models live.

The mechanism

For the active query $q$ and a candidate key $k$ , a head computes a compatibility score and normalises over all keys:

\alpha_{q,k} = \operatorname{softmax}_k\!\left(\frac{q \cdot k}{\sqrt{d_k}}\right)

The arcs above draw $\alpha$ directly. Add more Tokens to grow the sequence; add more Heads to stack independent attention patterns. The whole picture is the contents of a single transformer block's attention layer, running forever.

ben ebsworth