Attention, From the Inside Out
Attention is just a weighted average whose weights the data computes by asking itself questions. A worked tour through scaled dot-product attention, temperature and sampling, and what a 46B-active / 1T-total Mixture-of-Experts spec actually means — with live matrices you can poke.
Every explanation of the transformer eventually arrives at the same sentence: attention lets each token look at every other token. That sentence is true, and it is almost entirely useless. It tells you attention permits looking, but not what looking is, how hard a token looks, or why the mechanism produces the behaviour everyone is so excited about.
So here is a better one-liner:
The whole mechanism, in one breath
Attention is a weighted average. Each output token is a convex combination of the input tokens. The only thing the network learns is the weights — and it computes them by having the data ask itself a question: for the thing I am, which of these other things is most relevant to me?
That's it. No recurrence, no convolution, no memory cell — just a weighted average where the weights are a function of the data. The rest of this post is the arithmetic that turns that sentence into the matrix below.
The matrix computes, for the toy sentence "the cat sat because it napped", how strongly each token (a query) weighs every other token (a key) into its output. Hover any token on the left and watch its attention weights draw in as arcs — their thickness is the weight. The default lands on "it", and the headline result: it resolves to cat. A pronoun found its antecedent, with nothing more than a row of dot products and a softmax.
The soft lookup
The right mental model is a dictionary. A normal dictionary is a hard lookup: you give a key, you get exactly one value. {"cat": "feline animal"} — the key matches or it doesn't, and there is no middle ground.
Attention is a soft dictionary. You provide a query, and instead of returning one value, it returns a blend of all the values, weighted by how well each key matches the query. Close matches contribute a lot; poor matches contribute almost nothing. The blend is continuous, differentiable, and — crucially — learnable, because the weights are a smooth function of the inputs.
Three roles, then, projected out of every token's embedding:
- a query — "what am I looking for?"
- a key — "what do I have to offer?"
- a value — "if you pick me, here is what I contribute."
A token's query is dotted against every key. Big dot product means relevant; that key's value gets a large weight in the output. Small dot product means irrelevant; its value is all but ignored. The query and the key never need to be equal — they just need to be aligned in the space the projection learns.
The three projections
, , and are not given. Each is a learned linear projection of the same token embedding :
Three weight matrices, , map the shared embedding into three different roles. This is the entire reason attention is expressive: the same token can ask a sharp question (via ), present a specific offer (via ), and carry distinct content (via ). A token is not one vector — it is three, wearing different hats.
Why three projections, not one
You could set and attention would still run. But you'd have thrown away the model's ability to separate "what I want" from "what I am." The pronoun it wants to find a noun; a noun wants to be found. Those are different jobs, and they need different projections to do them well. The three matrices give the network the latitude to learn that distinction.
The equation
Everything above collapses into a single line — the whole of scaled dot-product attention:
Read it inside-out, in three steps:
- — the compatibility matrix. Row holds the dot products of query against every key. It is the raw, unscaled answer to "token , how relevant is each other token?"
- — turn each row of raw scores into a probability distribution. This is where the weights come from. Each row sums to .
- — use those weights to take a convex combination of the values. The result is the output token: a weighted average of the input tokens' values, weighted by relevance.
The softmax is the heart of it, so it deserves its own dissection.
The matrix, and why we scale
The softmax does two jobs at once: it makes every weight positive (via ), and it makes the row sum to one (via the normalising denominator). Watch one row of the matrix walk through both steps:
The first stage is the raw scores — signed, unbounded, and not yet a weighting. After exponentiation they're all positive, but not yet comparable (one could be , another ). Dividing by the sum — the partition function — is what turns them into a distribution. The row now reads like a probability table over keys.
The scaling by is not decoration. It is the difference between a model that trains and one that doesn't.
Drop the √dₖ and your gradients vanish
Without the scale factor, the dot products grow with the dimension (a sum of random-ish terms). As they grow, the softmax saturates: the largest score dominates, pushes everything else toward zero, and the distribution collapses to a near one-hot. A saturated softmax has gradients near zero in exactly the region you're trying to train, so learning stalls. Dividing by keeps the scores' variance roughly independent of dimension, which keeps the softmax in its useful, well-gradiented middle range.
Drag the slider in the matrix down to to see the collapse — "it" snaps to a near-100% pick of cat and the rest go dark. Drag it up and the distribution flattens toward uniform, washing the signal out. The default sits at the sweet spot.
This is the single most common point of confusion in attention, so it is worth the emphasis: the scale factor exists to manage the softmax's input range, not the math's correctness. The unscaled equation is still a valid weighted average; it just happens to produce weights that train terribly.
Many heads, many questions
A single attention head asks one question. "Given my query, which keys match?" But a token is usually doing several things at once: it wants to resolve its antecedent, agree on number with its verb, and inherit tense from the clause it lives in. One set of weights cannot do all of that simultaneously.
The fix is to run several attention heads in parallel, each with its own learned projections , concatenate their outputs, and mix them back with a final projection:
Each head learns to attend to a different kind of relationship. Probe a trained model and you find heads that specialise: some track subject–verb agreement, some look for the previous occurrence of the same token, some attend to closing brackets and quotation marks. None of this is hand-designed — it emerges from training, because the data rewards heads that carve out useful questions.
Heads are an ensemble, not a committee
A common mental trap is to think of multi-head attention as "voting." It isn't. The heads don't agree or compromise; they run independently and their outputs are concatenated. Each head adds a different view of the same tokens to the representation, and the downstream layers learn to use whichever views are useful. Eight heads is not eight opinions averaged down — it is eight features stacked side by side.
The mask
There is one more detail, and it is the difference between a model that understands text and one that merely sees it lying around in a bag. In a decoder (GPT and friends), token is not allowed to look at token . If it could, it would be reading the answer while predicting it — a spectacularly easy game to win and a useless thing to have learned.
The fix is causal masking: before the softmax, set every entry above the diagonal of to . After the softmax those positions become exactly zero weight. Token can attend to tokens , but never to its own future.
The mask is why an autoregressive model's attention matrix is lower-triangular — a staircase of allowed connections, with the future masked into silence. It is also why these models are expensive to use for generation: every new token adds a row, and the matrix grows quadratically with sequence length. FlashAttention and its descendants are, at their core, an accounting trick for never materialising that triangle.
Turning weights into tokens
Everything so far has produced, for each position, a rich vector representing the token in context. The very last thing a generative model does is turn that vector into an actual next token — and it does so with the softmax you already met, applied to a vector of logits (one score per word in the vocabulary).
This is where a knob called temperature shows up, and it is worth pausing on, because it is exactly the same idea as the scaling inside attention. The next-token distribution is
Temperature divides the logits before the softmax. Push it toward zero and the distribution collapses to a near one-hot pick — the model becomes greedy, always choosing its single favourite, which is coherent but repetitive. Push it high and the distribution flattens toward uniform — the model rambles, picking unlikely words. The same knob, the same mechanism, the same trade-off between "too sharp to learn/train" and "too flat to be useful."
Drag the temperature down and watch the distribution sharpen onto stage; drag it up and the long shots (drums, piano) swell. Hit sample a few times — at low temperature you'll draw stage every time; at high temperature the draws scatter across the vocabulary. The two other knobs, top-k and top-p, truncate the distribution before the draw: top-k keeps only the most likely tokens, top-p (nucleus) keeps the smallest set whose cumulative probability reaches . Both zero out the tail so the model can never sample a one-in-a-million word even at high temperature.
Why top-p is usually better than top-k
Top-k keeps a fixed number of candidates regardless of confidence. That's awkward: on "The capital of France is" the model is nearly certain and one candidate suffices, while on "The most interesting thing about" it is genuinely uncertain and twenty might be reasonable. Top-p adapts — it keeps as many candidates as the model's actual confidence warrants, few when the model is sure, more when it isn't. In practice, temperature + top-p is the combination most production decoders settle on.
Where the parameters live: 46B active out of 1T
A model described as "46B active, 1T total" is a Mixture-of-Experts (MoE) model, and the two numbers measure different things. The total counts every weight the model owns; the active counts only the weights a single token actually multiplies on its way through. The gap is the whole point of the architecture.
To see where the gap comes from, look at one transformer block. It has two halves: the attention you've spent this post inside, and a feed-forward network (FFN) — a two-layer MLP that does the bulk of the model's "knowledge" storage. In a dense model, every token runs through the same single FFN. In an MoE, that single FFN is replaced by parallel experts (each its own FFN), plus a tiny router that looks at the token and picks which experts should handle it.
Each token lights only 2 of 64 experts here. The dormant 62 still occupy memory — the full ≈1T of weights must live in VRAM whether they fire or not — but they do no multiply-accumulates. That is how you buy ≈1T of representational capacity for ≈46B of per-token compute: the sparsity is the bargain. Real models push it harder than this illustration; DeepSeek-V3 routes 8 of 256 experts (≈37B active of 671B), and GLM/Kimi-class models land in the same single-digit-percent range.
Active compute is cheap; total memory is not
The MoE trade has a sharp edge. You save compute (FLOPs per token scale with the active count) but you do not save memory — every expert's weights must be resident to be available when the router calls on them. A 1T-parameter MoE needs the VRAM to hold 1T of weights even though it only does 46B of work per token. This is why MoE models are served on multi-GPU setups with weight sharding: the bottleneck is fitting the dormant experts in memory, not the math of the active ones. It is also why training them is finicky — left to itself, the router collapses onto a handful of favourite experts and the rest never learn, so every MoE training recipe includes a load-balancing loss that penalises uneven expert usage.
And there is the full picture. The attention block — softmax over — is where tokens look at each other. The FFN (or, in an MoE, the few experts the router selects) is where each token, in light of what it just saw, decides what to become. Stack that block a few dozen times, apply a temperature-scaled softmax over the vocabulary at the end, sample, and you have a language model. Everything else — the engineering, the scale, the training tricks — is in service of those two operations.
Reading further
- Vaswani et al., Attention Is All You Need (2017) — the paper that started it. Short, dense, and still the canonical reference for the equation above. arXiv:1706.03762
- Bahdanau, Cho & Bengio (2014) — the additive attention that preceded the transformer, born from machine translation. Useful for understanding what problem "attention" was invented to solve before it became a generic block. arXiv:1409.0473
- Rush, The Annotated Transformer — the original paper, re-implemented line-by-line in PyTorch. The fastest path from "I've read the equation" to "I can run it." nlp.seas.harvard.edu
- Jay Alammar, The Illustrated Transformer — the canonical visual companion, with the diagrams most people have in their head when they say "attention." jalammar.github.io
- Shazeer et al., Outrageously Large Neural Networks: The Sparsely-Gated MoE Layer (2017) — where the router/top-k gating idea entered the deep-learning era. The load-balancing loss mentioned in the warning above comes from here. arXiv:1701.06538
- DeepSeek-V3 Technical Report (2024) — a modern, production-scale MoE (671B total, ≈37B active) with the kind of fine-grained expert routing the diagram above simplifies. A good reference for the memory-vs-compute trade in practice. arXiv:2412.19437
- Holtzman et al., The Curious Case of Neural Text Degeneration (2019) — the case for nucleus (top-p) sampling over top-k and greedy decoding. Where the "why top-p" intuition in the info callout comes from. arXiv:1904.09751
The matrix above runs the real arithmetic — hand-tuned scores, exact softmax, the actual scaling — so when you dragged that slider and watched "it" sharpen onto cat, you were watching equation (1) do exactly what it does in a production model. The only difference is scale: a real head does this over thousands of tokens, with weights learned across trillions of them. The mechanism is the one you just held in your hand.
Try it in the lab
All effects →Self-Attention
aiMulti-head self-attention as a live particle network — query tokens cycle, heads drift, weights flow.
attentiontransformerdeep-learningConformal Grid
mathsComplex mappings deforming a Cartesian grid — Joukowski, power maps, inversion.
complex analysisdifferential geometryDouble Pendulum
mathsChaotic pendulums diverging from near-identical starting conditions.
chaosode
More from the blog
The Neural Network Zoo, Revisited
A guided tour through the Asimov Institute's Neural Network Zoo — every architecture from the poster, with intuition for what each one is actually for and an interactive SVG diagram for the major families.
PLL Design from First Principles
A phase-locked loop is a control system with a phase detector instead of a summing junction. The intuition you can build with the lab above is more durable than the textbook derivations.
The Smith Chart is Geometry
What looks like a chart for radio engineers is really a Möbius transform drawn on the complex plane. A visual essay on why impedance matching is a question of circles, lines, and rotations.