Repair is just solving in disguise
I gave a 160M-parameter Go model a reflect-and-fix loop so it could bootstrap past its own ceiling. It fixed zero bugs. The reason runs deeper than the code — and it taught me my model had been memorising all along.
Here is a claim I was sure I could make true: a small code model that reads its own failing test output, reflects on what went wrong, and rewrites the code will beat the same model writing code in one shot. Give it a feedback loop and it climbs. That is the whole premise of a self-improving system.
So I built one. A 160M-parameter Go model drafts a function, I run go test, and if the tests fail the model gets the error and produces a reflection plus a revision. Train that critic on enough verified examples of "here's a broken attempt, here's the fix", and it should learn to repair.
It fixed zero bugs. Not "a disappointing one or two" — zero, across every variant I tried. And chasing why turned into the most useful week I've spent on this project, because the answer wasn't a hyperparameter. It was a fact about what a small model can and can't be taught.
The setup
Tekhne is my standing experiment in capability density: how much can you wring out of a deliberately tiny model? The base here is a 160M Go coder, SFT'd on HumanEval-X Go. On the full 164-problem benchmark it sits at 57.9% pass@1, which for a model this size is already punching above its weight.
The reflective loop, which I'd been calling a Reflective Code Transformer, works like this. Start from the Go checkpoint, bolt on two extra transformer layers (16 → 18), freeze the original 16, and train only the new layers plus the head on transcripts shaped as attempt → trace → reflection → revision. At inference: draft, execute, and if the tests fail, feed the failure back and let the new layers emit a fix.
- Base model
- 160M
- Go SFT checkpoint
- HumanEval-X Go
- 57.9%
- pass@1, full 164 problems
- Critic layers
- +2
- 16 frozen, 2 trained
For training data I needed verified repair traces: a failing attempt, the real error, and a revision that actually compiles and passes. I sampled drafts from the base model, kept the ones that failed, and used Gemini 2.5 Flash through the Vertex API as the teacher to write the reflection and the canonical fix. (The teacher hopping from Kimi to the gemini CLI to raw Vertex REST is its own saga of free tiers being quietly strangled. Another post.)
That gave me 497 verified transcripts: 437 in-distribution HumanEval-X Go failures with canonical-solution revisions, plus 60 cross-domain MBPP repairs for variety. Seven times more data than my first attempt, and the same benchmark family as the eval. Everything pointed the right way.
The result that wouldn't move
I trained the critic on all 497 traces, watched the validation loss fall to a healthy 0.90, and ran it on a held-out 30-problem slice.
- Base pass@1
- 73.3%
- 22 / 30, held-out
- Critic pass@1
- 73.3%
- 22 / 30 — 0 fixed
- Loop pass@1
- 73.3%
- no movement at all
Twenty-two out of thirty, before and after. The critic fixed none of the eight failures. And this was with more data, cleaner training, and a much better loss than the run before it.
Now, an earlier version of this experiment had given me a lovely "+3.4 points, 76.7%" result that I'd half-believed. So before concluding anything, I went back to check whether that number was real. It wasn't. It was one problem out of thirty flipping, measured against a weak 43.3% base from an undertrained checkpoint. A 1/30 artifact dressed up as a trend.
Before you call a change an improvement, ask how many problems actually moved. "+3.4 points" on a 30-problem set is exactly one problem. Against a noisy baseline, that's not a result — it's a coin landing the way you hoped.
Two bugs hiding in the ruler
When a number won't move, suspect the ruler before the model. Two of my eight "failures" weren't the critic's fault at all.
The first was brutal. My build_full_code helper reassembled the model's revision by extracting the first {...} block — which silently truncated any revision containing nested braces. A correct fix with an inner loop or struct literal became a compile error before it ever reached go test. The critic was sometimes right and I was throwing the answer away.
The second was subtler and more interesting. The eval ran go test without first running goimports. So a revision that was logically perfect but left an unused import counted as a hard failure. Three of the eight held-out "failures" were pure import housekeeping — the kind any real pipeline fixes for free.
Run goimports and the base model's true score on that slice is 25/30, not 22/30. The eval had been undercounting the base by ten points on the full benchmark. I'd been trying to beat a baseline I'd accidentally handicapped.
With both bugs fixed, the honest picture got sharper, not better. True base: 83.3%. Real logic bugs among the failures: five. Critic fixes of those five: still zero.
Why repair is just solving
Here's the part I should have seen coming. Look at what the critic is actually asked to do. Given a failed attempt and an error, produce the correct code. For a genuine logic bug — not an unused import, an actual wrong algorithm — producing the correct code means solving the problem.
And the base fails those five problems precisely because it can't solve them. The critic is the same 160M base with two extra layers welded on. It shares the base's ceiling. It can learn the shape of a good reflection, the calm "the loop was missing a termination condition" prose, and then emit code that's still wrong, because knowing how to narrate a fix is not the same as knowing the fix.
To repair a solution you have to be able to solve the problem. A critic built on a model that couldn't solve it can't repair it either. It just learns to sound like it did.
This reframes the whole loop. The most a draft-execute-revise loop can do is give the base model more chances — more samples, conditioned on the failure. That's real, but it's bounded by what the base could reach on its own by resampling. The honest way to write that down is the standard pass@k estimator:
If a problem has correct samples in the base's whole output distribution, no amount of saves it, and no reflection layer trained on the same base changes . The critic can reorder which solvable problems get solved first. It cannot manufacture a solution the base never had.
So I went looking for the ceiling
At this point the user steering the session said, reasonably: stop and think about how to push a 160M model to 90%. And then, the question that actually mattered — try an unseen benchmark.
Because here's the uncomfortable thing about that 83.3%. HumanEval-X Go was in the training data. A high pass@1 on a benchmark you trained on tells you the model memorised the benchmark. It does not tell you the model can write Go.
So I generated 25 fresh Go problems of comparable difficulty, never seen in training, verified that the canonical solutions compiled and passed, and ran the model.
- Leaked HumanEval-X
- 83.3%
- true base, held-out
- Fresh problems
- 0.0%
- 0 / 25, pass@1
- Fresh, pass@10
- 0.0%
- 0 / 25, even resampled
Zero. Not low — zero, and zero again at pass@10 with execution filtering. The model that "scored 83%" could not solve a single novel problem of the same difficulty. It wasn't a weak coder. It was a lookup table that had memorised one specific benchmark and learned, in any real sense, nothing.
A leaked benchmark doesn't measure a small model's skill. It measures how completely it overfit. The gap between 83% leaked and 0% unseen is the memorisation, laid out as a single subtraction.
The last thing I tried was the obvious repair for that: continually fine-tune on a few hundred diverse Go problems so the model learns to code rather than recite. I built 250 fresh problems and trained at a deliberately safe learning rate. The unseen score stayed at 0 of 25. And the memorised HumanEval-X slice collapsed from around 60% to 2.5% — one problem out of forty. The lookup table was brittle enough that nudging it toward generality shattered the one thing it was good at, without buying any generality back.
What I actually learned
The honest bottom line is not a tidy success, and I'd rather report it straight. A 160M critic can't beat its base, because repair is solving and it shares the ceiling. And this particular base wasn't really solving anything — it had memorised a benchmark. You can't bootstrap a model past what it can't already do, however clever the loop on top.
If I wanted the number to go up tomorrow, the cheapest honest win is already sitting there: run goimports and sample the base a few times. That reaches 83.3% on the leaked set with no critic at all, and it would almost certainly beat any 160M reflection layer. The loop was the interesting idea. The plumbing was the actual lever.
Where this goes next is the harder, slower path — a less overfit base trained on a genuinely large and diverse corpus, where the model has real headroom to repair into. The whole appeal of a self-improving loop is getting something for nearly nothing. This week was a clean reminder that the "nearly nothing" has to include a model that already knows how to do the thing. Watch this space.
Reading further
- Evaluating Large Language Models Trained on Code (Chen et al., 2021) — the Codex paper, and the source of the pass@k estimator above.
- Self-Refine: Iterative Refinement with Self-Feedback (Madaan et al., 2023) — self-correction that works, and a useful contrast: it leans on a base strong enough that repair lands inside its ceiling.
- Large Language Models Cannot Self-Correct Reasoning Yet (Huang et al., 2023) — the careful negative result; intrinsic self-correction often doesn't help, for reasons that rhyme with what I hit here.
Try it in the lab
All effects →A* Pathfinder
aiA*, Dijkstra, and greedy best-first search — the heuristic pulling the frontier toward the goal.
searchgraphsa-starGradient Descent
aiSGD, Momentum, RMSProp, and Adam racing down a loss landscape — ravines, saddles, and local minima.
optimizationdeep-learningtrainingSelf-Attention
aiMulti-head self-attention as a live particle network — query tokens cycle, heads drift, weights flow.
attentiontransformerdeep-learning
More from the blog
Four ways to shrink a KV cache
A transformer's KV cache is a four-dimensional tensor, and every compression trick — quantisation, eviction, cross-layer sharing, linear attention — attacks one of its axes. Here is the tour, and the cautionary tale of a tiny code model whose accuracy fell 20 points because a smoke test never exercised the one axis that bites.
Attention, From the Inside Out
Attention is just a weighted average whose weights the data computes by asking itself questions. A worked tour through scaled dot-product attention, temperature and sampling, and what a representative 46B-active / 1T-total Mixture-of-Experts spec actually means — with live matrices you can poke.
The Neural Network Zoo, Revisited
A guided tour through the Asimov Institute's Neural Network Zoo — every architecture from the poster, with intuition for what each one is actually for and an interactive SVG diagram for the major families.