Four ways to shrink a KV cache
A transformer's KV cache is a four-dimensional tensor, and every compression trick — quantisation, eviction, cross-layer sharing, linear attention — attacks one of its axes. Here is the tour, and the cautionary tale of a tiny code model whose accuracy fell 20 points because a smoke test never exercised the one axis that bites.
A small Go code model was about to ship with a compressed KV cache. The recipe stacked two tricks — drop most of the cached tokens, and store the survivors in 4 bits instead of 16 — and on the 15-problem smoke test it was a free lunch: identical pass rate, a cache an order of magnitude smaller. Then the full 164-problem benchmark ran, and the pass rate fell off a cliff: 57.9% to 37.8%, twenty points gone. Nothing about the model had changed. The compression that was free on fifteen problems was very expensive on a hundred and sixty-four.
The interesting part is not that the compression broke. It is which compression broke, and why the smoke test couldn't see it coming. Quantising the cache to 4 bits turned out to be genuinely free. Throwing tokens away was the part that cost twenty points — and only on the problems the smoke test happened not to contain. To see why, you have to look at what the KV cache actually is, because every way of shrinking it is an attack on a different part of the same object.
Pick a method in the panel above. Each one shrinks the cache, but each one shrinks a different axis of it — and that single fact organises the entire field. Quantisation makes each number smaller. Eviction throws away tokens. Cross-layer sharing reuses one layer's cache for its neighbours. Linear attention refuses to keep a per-token cache at all. Same goal, four completely different bargains — and, as the code model found out, four completely different failure modes.
What the cache is, and why it grows
When a transformer generates text, it produces one token at a time, and each new token attends to every token before it. The naive way to do that would recompute the key and value vectors for the whole sequence on every step — quadratic work, most of it repeated. The KV cache is the standard fix: compute each token's key and value vectors once, when the token first appears, and keep them around. Generation then attends against the cache instead of recomputing it. You trade memory for arithmetic, and it is one of the best trades in systems — without it, autoregressive decoding is unusably slow.
The bill arrives as memory. The cache has to hold a key and a value vector for every token, in every attention head, in every layer. Write that down and it is a tensor with a very specific size:
For the model in this story — 16 layers, 3 key/value heads, a head dimension of 64, in fp16 — every term except is fixed the moment you choose the architecture. , , are baked in. The bytes-per-element is 2 because the weights are fp16. Only , the sequence length, moves, and it moves linearly and without bound as the model generates. A long conversation, a long document, a long chain of reasoning — the cache grows one slab per token until it dominates the memory budget and caps how many requests you can batch. That linear-in- growth is the whole problem, and it is exactly what the compression methods attack.
Five dimensions, four of them already spent
Look back at equation (1) and notice how little room there is. and are architectural givens. The factor of 2 for K and V is non-negotiable. The head count has already been attacked before inference even starts: this model uses grouped-query attention, where 12 query heads share just 3 key/value heads — a 4× cut, applied at training time, that is why is 3 and not 12. Multi-query attention pushes that to a single shared head. By the time you are choosing an inference-time trick, the heads axis is spent. What is left to attack is the precision , the length , and the layer count — which is exactly the menu.
The precision axis: TurboQuant
The least invasive thing you can do is keep every token and just store each number in fewer bits. fp16 spends 16 bits on each entry of the cache; most of that precision is wasted, because the values in a small group cluster tightly and a coarse grid would land close enough. Quantisation replaces each number with an index into evenly-spaced levels spanning the group's range, plus a per-group scale to reconstruct it.
Drag the bit selector. At 8 bits the 256 levels sit so close to the originals that the error is invisible. At 4 bits — only 16 levels — you would expect trouble, and you get it if one value is an outlier: a single large number stretches the range, the levels spread out, and every small value quantises badly. That is what the rotate toggle fixes. Multiplying the group by a random orthogonal matrix spreads one value's magnitude across all of them without changing the vector's length, so the range shrinks and the same 16 levels land close to everything. Rotate, then quantise per group — that is the core idea behind TurboQuant.1
The memory math is direct. Going from 16 bits to 4 is a 4× cut on the codes themselves; the only overhead is a per-group scale and zero-point, shared across a group of 32 values, which adds about one bit per value:
And the accuracy cost? On the full benchmark, 4-bit quantisation alone scored 58.5% against the baseline's 57.9% — 96 problems solved versus 95. That is not a real improvement; it is one problem out of 164, comfortably inside run-to-run noise. But it is decisively not a regression. Quantising the cache to a quarter of its size changed nothing the benchmark could measure.
Lossless, in the only sense that matters
"Lossless" here does not mean bit-identical — the reconstruction error in the panel above is real. It means behaviourally lossless: the model solves the same problems. When a 4× memory cut moves your score by a single problem out of 164, the honest reading is that the precision was never load-bearing. Most of those 16 bits were storing noise. This is the comfortable result, the one you hope for — and it makes the contrast with the next axis all the sharper.
The sequence axis: eviction, and the attention sink
The length axis is more tempting and more dangerous. is the term that grows without bound, so capping it caps the cache. Eviction does exactly that: keep a fixed budget of tokens and throw the rest away. The question is which tokens.
The naive answer — keep the most recent ones, slide a window — fails badly, and the reason is one of the more surprising findings in the field. Language models dump a huge fraction of their attention onto the first few tokens of the sequence, almost regardless of content. These attention sinks act as a kind of no-op bucket the softmax can park probability in when no real token deserves it; evict them and the attention distribution destabilises and generation falls apart. StreamingLLM's recipe follows directly: keep a handful of sink tokens at the very start, plus a sliding window of recent tokens, and drop the middle.2
The widget shows the bargain. Four sink tokens are pinned at the head; a recent window of budget − 4 tokens rides the generation frontier; everything between them is evicted. H2O refines this by also pinning a few "heavy hitter" tokens from the middle — the ones with the highest accumulated attention — but the budget is the same.3 Press generate and watch the window slide. As long as the whole sequence fits inside the budget, nothing is lost. But push the prompt length up or let generation run, and the moment the context exceeds the budget, the recent window's left edge marches past the end of the prompt. The prompt body falls into the evicted region. The model is now generating code for a problem it can no longer see.
That is the failure, stated precisely. The cache keeps the sinks and the recent window, so the entire prompt survives only while
For a short prompt with a short answer, that inequality holds for the whole run and eviction is free. For a real coding problem — a docstring, a signature, some imports, then a hundred-token function body to generate — it stops holding partway through, and the prompt scrolls out of memory mid-generation. The budget, not the policy, is what decides.
The collapse, and the attribution
So the compound recipe — eviction at a 256-token budget, plus 4-bit quantisation — lost twenty points on the full benchmark. The natural suspect is the 4-bit quantisation; aggressive low-precision feels like the risky move. The way to find out is to take the compound apart and run each piece alone.
Read down the evict @ 256 group. Streaming at budget 256, H2O at budget 256, and streaming-plus-4-bit-quant all land on exactly 37.8% — 62 problems out of 164, to the problem. Quantisation adds nothing to the drop (it scores 58.5% on its own). A smarter eviction policy adds nothing either (H2O matches streaming exactly). The only thing those three configs share is the budget. Now read the evict @ 512 group: doubling the budget to 512 recovers the full baseline, 57.9%, and the recommended streaming-512-plus-4-bit config sits at 59.1% — within noise of baseline while shrinking the cache to roughly a fifth.
Three different methods landing on exactly the same wrong number is not a coincidence. It is the signature of a cause they all share — and the only thing they shared was a budget smaller than the prompts.
This is the clean part of the story. A controlled ablation, three independent knobs, and the degradation tracks exactly one of them. The twenty-point drop was never about precision. It was about a budget that was smaller than the problems, evicting the problem statement before the model finished answering it.
Why the smoke test lied
Which leaves the real question: how did fifteen problems show this as a free lunch? The answer is in the one thing the smoke test never did — it never made the cache overflow.
The smoke set's prompts top out at 48 tokens. Sinks plus prompt plus a 64-token answer is barely over a hundred — nowhere near the 256-token budget, so eviction never triggered. The trim function looked at every smoke problem, saw it fit, and kept everything. The 15-problem benchmark wasn't measuring lossy compression that happened to be harmless; it was measuring a mechanism that never ran. A green light from a test that doesn't exercise the thing under test is not evidence. It is the absence of evidence wearing evidence's clothes.
Drag the generated so far slider and the trap becomes visible. At prefill — nothing generated yet — only 4 of the 164 full-set prompts even exceed 256 tokens, which is why this is so easy to miss by eyeballing prompt lengths. But the cache doesn't stop growing at prefill. Every generated token pushes the effective context right, and as the slider climbs, problem after problem crosses the budget line and starts shedding its prompt. The smoke set, meanwhile, never moves off zero. It cannot cross a line it starts a quarter-mile short of.
The test that passes for the wrong reason
The dangerous failure is not a test that goes red. It is a test that goes green because it never reached the code path that would have failed. Aggressive eviction looked lossless on the smoke set for the same reason a fire alarm looks functional when there is no fire. If you are validating a mechanism, your test set has to trigger it — here, that means prompts long enough, and generations long enough, to overflow the budget. Coverage is not how many cases you run; it is how many of the failure modes your cases can actually reach.
The other two axes, and why they need training
Quantisation and eviction are inference-time tricks: bolt them onto a finished model and they work (or, in eviction's case, work until the budget bites). The remaining two axes are not so polite.
Cross-layer sharing attacks the layer count . Adjacent transformer layers compute surprisingly similar keys and values, so you can compute K and V once and reuse them across a group of layers — halve the layers you store, halve the cache. But "surprisingly similar" is a property a model has to learn. Bolted onto this already-trained checkpoint with no retraining, cross-layer sharing collapsed generation to 0%; even several rounds of fine-tuning didn't recover it. The redundancy CLA exploits has to be trained in from the start, not assumed after the fact.
Linear attention is the most radical: it attacks by abolishing it. Instead of caching one entry per token, it folds the whole past into a fixed-size recurrent state, so memory stops growing with sequence length entirely — the real prize for very long contexts. The catch is the same, only harder: a linear-attention layer is a different computation, and loading attention-trained weights into it leaves the layer effectively random. Zero-shot, it scores 0%. It is not a compression you apply; it is a model you train.
The pattern across the axes
There is a clean gradient here. Attack precision (quantise) and the model doesn't notice — the bits were slack. Attack the sequence (evict) and it's free until you cross the context the task actually needs, then it falls off a cliff. Attack layers or replace attention, and you are changing what the model computes, so the model has to be trained for it or it produces nothing. The further the axis is from "redundant storage" and the closer to "load-bearing computation," the more the method costs — and the more a too-easy benchmark will hide that cost.
What to actually ship
The fix was not to abandon compression — it was to use the axis that was free and size the dangerous one to the task. Quantise to 4 bits, because precision was never load-bearing; evict, but with a budget large enough that the running context fits inside it. For this benchmark, a 512-token budget plus 4-bit quantisation matched the uncompressed baseline (97 problems versus 95 — noise) while cutting the cache to roughly a fifth.
# the recipe that survives the full benchmark, not just the smoke set
use_kv_eviction: true
kv_eviction_mode: streaming # 4 attention sinks + recent window
kv_eviction_budget: 512 # ≥ prompt + generation for this task
use_turboquant: true
turboquant_key_bits: 4 # quantisation was the free part all along
turboquant_val_bits: 4How to read a 'free' compression result
Before you believe a compression number, ask three things. Did the test trigger the mechanism? Eviction that never overflows is not being tested. Can you attribute the result? Run each knob alone; if a compound moves and you don't know which part moved it, you know nothing. Is the win inside the noise? A one- or two-problem swing out of 164 is a wobble, not a gain — so report compression as "matched the baseline at a fifth the size," never as "improved accuracy." Get those three right and the KV cache is one of the most compressible things in the whole inference stack. Get the first one wrong and you ship a model that forgets the question.
The KV cache is a tensor with five axes, four of them effectively fixed before you start. The art of shrinking it is knowing which of the remaining ones you are allowed to touch for free, which one will bite the moment your real workload is longer than your test, and which ones aren't compression at all but a different model wearing the same weights. The tiny Go model got there in the end — but only because a hundred and sixty-four problems told it what fifteen never could.
Reading further
- Efficient Streaming Language Models with Attention Sinks — Xiao et al., 2023 (ICLR 2024). The attention-sink finding and the sink-plus-recent-window recipe that "streaming" eviction implements.
- H2O: Heavy-Hitter Oracle for Efficient Generative Inference — Zhang et al., NeurIPS 2023. Eviction by accumulated attention; the "heavy hitter" tokens worth keeping from the middle.
- SnapKV: LLM Knows What You're Looking For Before Generation — Li et al., NeurIPS 2024. Prompt-side eviction that picks important tokens using an observation window at the end of the prompt.
- TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate — Zandieh et al., 2025. The rotate-then-quantise vector-quantisation method; KV-cache quantisation is one of its applications. (The experiment here uses a deliberately simplified, uniform-quantisation variant.)
- Evaluating Large Language Models Trained on Code — Chen et al., 2021. Introduces Codex, HumanEval, and the pass@k metric used throughout.
- CodeGeeX: Multilingual Benchmarking on HumanEval-X — Zheng et al., KDD 2023. The 164-problem HumanEval set hand-ported to Go and four other languages.
Footnotes
-
TurboQuant (Zandieh et al., 2025) is a general near-optimal vector-quantisation algorithm; KV-cache quantisation is one application of it. The model here implements a simplified variant — random orthogonal rotation plus per-group uniform quantisation, with the scales kept in full precision — not the full method. ↩
-
The "sink" framing comes from StreamingLLM (Xiao et al., 2023). The configuration here keeps 4 sink tokens and a recent window of
budget − 4. ↩ -
H2O (Zhang et al., 2023) is an eviction policy — it chooses which token entries to keep — not a quantisation method. That streaming and H2O land on the identical score at budget 256 is precisely what isolates the budget, rather than the policy, as the cause. ↩
Try it in the lab
All effects →Self-Attention
aiMulti-head self-attention as a live particle network — query tokens cycle, heads drift, weights flow.
attentiontransformerdeep-learningA* Pathfinder
aiA*, Dijkstra, and greedy best-first search — the heuristic pulling the frontier toward the goal.
searchgraphsa-starGradient Descent
aiSGD, Momentum, RMSProp, and Adam racing down a loss landscape — ravines, saddles, and local minima.
optimizationdeep-learningtraining
More from the blog
A* Search, Visually: the Heuristic Is the Whole Game
A* is not a clever algorithm so much as Dijkstra plus a bet about the future. The same code becomes Dijkstra, greedy best-first, or A* depending on one term in the priority key — and admissibility is the single property that buys optimality.
B-Trees vs LSM-Trees: The Two Religions of On-Disk Data
Every database you use bets on one of two storage engines: B-trees (read-optimised, update-in-place) or LSM-trees (write-optimised, append-and-compact). The choice isn't about speed but about which kind of amplification you're willing to pay.
How Python Dicts Really Work: Open Addressing, Probing, and the djb2 Lie
The dict is the most-used data structure in Python and almost nobody knows it uses open addressing, not chaining — so a single deletion leaves a tombstone, and a simple hash table is really a careful dance between load factor, probe sequences, and cache lines.