Backprop is just the chain rule
Training a neural network sounds mystical, but the engine underneath is one idea from first-year calculus: the chain rule, applied backwards through a computation graph and reusing its work. We trace a forward and backward pass through a tiny graph, see why we run it in reverse, and connect it to the downhill step that actually does the learning.
"The network learns" is one of those phrases that does a lot of quiet work. It makes training sound like something the model does, some emergent striving toward correctness. The reality is less mysterious and, to me, more impressive: the whole thing runs on a rule you met in your first calculus class. Backpropagation, the algorithm behind every trained neural network, is the chain rule. That's it. The cleverness is entirely in how it's arranged so that computing a billion derivatives costs about the same as computing the answer once.
Let's actually work through it, because the idea is small enough to hold in your hand and it demystifies an enormous amount.
The setup
A neural network is a giant composition of simple functions: multiply by some weights, add a bias, squash through a nonlinearity, repeat, and at the end compare the output to the truth with a loss. Training means nudging every weight in the direction that lowers that loss, which means we need, for every weight , the partial derivative : how much the loss moves when you wiggle that weight.
There can be billions of weights. So the question isn't "can we differentiate this" (the chain rule says yes), it's "can we get all those derivatives without doing billions of separate calculations". Backprop's answer is a tidy yes.
One forward pass, one backward pass
Forget the billion for a moment and take a toy. Here's the expression drawn as a graph: inputs at the bottom, operations stacked above. Hit forward to compute the value flowing up; hit backward to compute the gradient of with respect to everything, flowing down.
Forward is ordinary arithmetic. Backward is the interesting half. We start at the top with and push derivatives down, multiplying by each local derivative as we pass through it. The chain rule, one edge at a time:
Now look at in the diagram. It feeds two nodes: it goes into and into . When you wiggle , both paths to respond, so its gradient is the sum over both paths:
Toggle to backward and you'll see exactly that: the two highlighted paths into , and its gradient being their sum. That "gradients add over paths" rule is the entire generalisation from single-variable to multi-variable calculus, and it's all backprop needs.
The whole algorithm, honestly
Backpropagation is: do a forward pass and remember every intermediate value; then walk the graph backwards, and at each node multiply the gradient coming from above by the node's local derivative, summing where paths rejoin. The reason it's fast rather than just correct is reuse. Notice that and both get used again further down: they're computed once and shared, not recomputed for every input. That single act of bookkeeping, caching the shared subresults, is the difference between a derivative you can afford and one you can't.
Why backwards, specifically
You could apply the chain rule in the other direction, pushing derivatives up from each input. That's "forward mode", and it works fine, with one fatal catch: each forward sweep gives you the derivatives of everything with respect to one input. With a billion inputs (weights) and one output (the loss), you'd need a billion sweeps.
Reverse mode flips it. One backward sweep gives you the derivative of one output with respect to everything. Since we have exactly one loss and a billion weights, that's precisely the direction we want: a single backward pass, and every gradient falls out at once.
The asymmetry that decides everything
Forward mode is cheap when you have few inputs and many outputs; reverse mode is cheap when you have many inputs and few outputs. Training a network is the extreme of the second case (millions to billions of parameters, one scalar loss), so reverse mode (backprop) wins by a factor of however many parameters you have. This is also why the forward pass has to stash its intermediate activations: the backward pass needs them to compute the local derivatives, which is why training a big model eats so much more memory than just running it.
From gradient to learning
Backprop only computes the gradient. The learning is the next, almost embarrassingly simple step: take a step downhill. For every parameter,
That's the loop: forward pass to get the loss, backward pass to get the gradient, one small step against it, repeat a few million times. The lab below is that downhill step in action on a loss surface. I wrote about the optimisers that steer it separately, but it's worth seeing here so the picture is complete: backprop tells you which way is down, and gradient descent walks.
Where it goes wrong
The chain rule multiplies. Multiply many numbers below one together and the product rushes to zero; multiply many above one and it blows up. Stack a deep network and the gradient reaching the early layers is a long product of factors, so it tends to either vanish (early layers stop learning) or explode (training diverges). This isn't a bug in backprop, it's backprop working correctly on a badly-conditioned graph.
Most of the architectural furniture of modern networks is really about keeping that product well-behaved: normalisation layers keep the factors near a sane scale, careful initialisation starts them there, and the residual connections I went on about in the transformer post add a clean path so the gradient has an unbroken route home that doesn't get multiplied to death. Once you see training as a long product of derivatives, half the tricks in deep learning reveal themselves as "stop the product from misbehaving".
Some food for thought: the chain rule is about 300 years old, autodiff as an idea is from the 1960s, and backprop-for-networks was nailed down in 1986. None of the maths is new. What changed is that we got enough compute to run the backward pass over enough data, and the old calculus did the rest. It's a good reminder that the breakthrough isn't always a new idea; sometimes it's an old idea that finally became affordable.
Recap
Backprop is the chain rule, run backwards through the computation graph, caching shared subresults so the cost stays near that of a single forward pass. Run it once and you get every parameter's gradient; feed those to a downhill step and the network "learns". No striving, no magic, just derivatives and good bookkeeping, which I honestly find more impressive than magic would be.
Reading further
- Rumelhart, Hinton & Williams (1986), Learning representations by back-propagating errors: the paper that put backprop on the map for neural networks. Short and very readable. nature.com
- Olah, Calculus on Computational Graphs: the clearest visual explanation of forward vs reverse mode going, and the direct inspiration for the graph above. colah.github.io
- Karpathy, micrograd: backprop in about 100 lines of Python you can read in one sitting, building exactly the graph machinery above from scratch. github.com/karpathy/micrograd
- Baydin et al. (2018), Automatic Differentiation in Machine Learning: a Survey: the thorough reference once you want the full landscape of autodiff. arXiv:1502.05767
Try it in the lab
All effects →A* Pathfinder
aiA*, Dijkstra, and greedy best-first search — the heuristic pulling the frontier toward the goal.
searchgraphsa-starGradient Descent
aiSGD, Momentum, RMSProp, and Adam racing down a loss landscape — ravines, saddles, and local minima.
optimizationdeep-learningtrainingSelf-Attention
aiMulti-head self-attention as a live particle network — query tokens cycle, heads drift, weights flow.
attentiontransformerdeep-learning
More from the blog
How to paint with noise
Image generators start from pure TV static and end with a photo. The trick that makes it possible is wonderfully sneaky: don't learn to paint, learn to remove a little noise, then run that backwards from static. We build the forward noising process step by step, see the signal-versus-noise schedule, and work out why predicting noise is such a clever thing to train.
A transformer reads everything at once
The transformer's one real trick is reading every token at once and letting each decide what matters. We put the whole machine on the bench — embeddings, positions, the residual stream, the feed-forward step — and work out why reading everything at once was such a departure, and why something so architecturally dull keeps getting smarter the more we feed it. With an interactive animation for every piece.
Four ways to shrink a KV cache
A transformer's KV cache is a four-dimensional tensor, and every compression trick — quantisation, eviction, cross-layer sharing, linear attention — attacks one of its axes. Here is the tour, and the cautionary tale of a tiny code model whose accuracy fell 20 points because a smoke test never exercised the one axis that bites.