Learning by Rolling Downhill
Every neural network you've ever used was trained by the oldest trick in calculus: to minimise a function, walk downhill. The whole story of modern optimisers is a list of the specific ways plain downhill walking fails, and the patch for each.
The entire training run of a billion-parameter model, the thing that costs a data centre weeks and a small fortune in electricity, is built on a trick Augustin-Louis Cauchy wrote down in 1847. To make a function smaller, find the direction in which it falls most steeply, take a step that way, and repeat. That's it. Hadamard arrived at the same idea independently around 1907; Haskell Curry first studied when it actually converges in 1944. None of them had a neural network in mind. They just wanted to minimise things, and downhill is where the minimum is.
What's genuinely surprising is not that this works. It's how badly it works in its raw form, and how much of modern machine learning is a stack of patches for the specific ways naive downhill walking falls over. So let's start with the trick, then break it on purpose, because the failure modes are where the understanding lives.
The rule, and the one number that breaks it
Gradient descent is a single line. Let be the parameters, the loss, the learning rate.
The learning rate is deceptively dangerous. Take a one-dimensional quadratic with curvature (the second derivative). One step multiplies your distance from the minimum by . So you converge only when , that is for . Below you glide in monotonically; between and you converge but oscillate, ping-ponging across the minimum with shrinking amplitude; at exactly you orbit forever; above it you diverge. The same lever that sets your speed sets your stability, and the boundary is the curvature you usually don't know.
One learning rate, many curvatures
Here is the catch that makes tuning hard. Real loss surfaces curve differently in different directions, and you get one learning rate for all of them. Take the bowl in the lab, . Its gradient is , so the curvature is along and along . The direction diverges once , while would happily take up to . Your safe learning rate is dictated by the steepest direction, which means the gentle directions crawl. That gap between the tightest and loosest curvature is the condition number, and it is the villain of the next two sections.
Failure mode one: the ravine
Switch the lab to the Ravine preset, a scaled Rosenbrock function with a long, curved, narrow valley. (The textbook Rosenbrock uses a steepness constant of 100; the lab dials it back to 20 so the traces stay on screen, but the pathology is the same.) This is the case that humbles plain gradient descent.
Watch the orange trace, plain gradient descent. The valley walls are steep and its floor is nearly flat, so the gradient points mostly across the valley, not along it. Each step throws the dot at the opposite wall; it zig-zags wall to wall and inches forward almost not at all. All the energy goes sideways, where you don't want to move, and almost none goes down the channel, where you do.
The fix, the yellow trace, is momentum. Instead of stepping on the current gradient, accumulate a running, discounted sum of past gradients, a heavy ball that keeps its heading.
The across-valley components flip sign every step, so they cancel in the running sum. The along-valley component points the same way every step, so it builds. A heavy ball ignores the rattling walls and rolls down the floor. The decay factor (typically ) sets how much past the ball remembers.
Failure mode two: the saddle
The folk story of optimisation says the danger is getting stuck in a bad local minimum. In high dimensions that story is mostly wrong, and the Saddle preset shows why.
A saddle is a point that's a minimum along some directions and a maximum along others. The gradient there is nearly zero, so equation (1) takes vanishing steps and the optimiser can sit on the ridge as if it had arrived, when really it just needs to find the one direction that goes down.
In high dimensions, saddles are the enemy, not minima
Why care so much about saddles? Because of a counting argument. A true local minimum needs every curvature direction to point up; a saddle needs only one to point down. With a handful of parameters that's a mild distinction. With millions of parameters, having all of them curve upward at once is astronomically unlikely, so almost every point where the gradient vanishes is a saddle, not a minimum. Dauphin and colleagues made this precise in 2014: in high-dimensional non-convex problems, critical points are exponentially more likely to be saddles. The thing we feared, a bad minimum, is rare. The thing we ignored, an endless plateau around a saddle, is everywhere.
Adapting the step per direction
Momentum fixes direction; the next idea fixes scale. Instead of one global learning rate, give every parameter its own, shrinking the step for directions whose gradients have been large and growing it for directions that have been quiet. RMSProp keeps a running average of each squared gradient and divides by its root.
The is an elementwise square; the division is elementwise too. A direction that keeps producing big gradients (the steep valley wall) gets a big and a damped step. A direction with small gradients (the valley floor) gets a small and a relatively larger step. It directly attacks the condition-number problem from the first callout.
Adam is the one everybody reaches for, and it is just momentum and RMSProp bolted together, with one extra correction.
What the bias correction is actually fixing
The moment estimates and both start at zero, so early on they read too low. The popular intuition is that this makes the first steps tiny and Adam "stalls on the launch pad". That's backwards, and it's worth getting right. The update is the ratio , and since both and are biased toward zero, the biases largely cancel. Run the numbers at the first step with , : the correction multiplies by , the correction by , and the square root turns that into . Drop the correction entirely and the first step is about times too big, not too small. The real job of the factors is to fix the mismatch (the slower-decaying bias) and bring the early steps back to roughly the intended size .
On the ravine above, that's why the purple Adam trace slides almost straight down the channel while orange thrashes: per-direction scaling plus momentum is exactly the combination a narrow valley punishes you for lacking.
Why the noise is a feature
There's a piece of sleight of hand in the name. We say "gradient descent" but in practice train with stochastic gradient descent, estimating the gradient from a small random batch rather than the whole dataset. The estimate is noisy, and for years that noise was treated as a regrettable cost of not affording the full gradient. It turns out to be doing real work.
A full-batch optimiser computes the exact gradient, and at a saddle the exact gradient is genuinely tiny, so it can sit on the ridge indefinitely, fooled into thinking a mountain pass is a summit. The noisy estimate from a mini-batch is almost never exactly zero, so it jostles the optimiser off the ridge and back onto a downhill direction. The randomness we apologise for is what rescues us from the plateaus of the previous section. For this to converge you want the step sizes to satisfy the Robbins-Monro conditions, (you can still travel any distance) and (the noise eventually gets damped), which is the formal reason learning-rate schedules decay over training.
The bug that turned out to be a feature
Stochastic noise does two good things at once. It knocks the optimiser off saddles and flat ridges, and it is widely believed to bias training toward flatter minima, which tend to generalise better to unseen data than sharp ones. The flat-minima story is the mainstream view rather than a settled theorem (there are honest counterexamples in the literature), so hold it loosely. But the broad shape is real: a cheaper, noisier gradient often trains a better model than the expensive exact one. That is not the trade-off anyone expected.
So is Adam simply the answer? Not quite, and this is the kind of cost-aware caveat worth keeping. Wilson and colleagues showed in 2017 that adaptive methods like Adam can converge to solutions that generalise slightly worse than well-tuned plain SGD with momentum, which is why a fair amount of computer-vision work still finishes training on plain SGD. It's a tendency, not a law: most large language models train end to end on Adam and its weight-decay variant AdamW perfectly happily. The honest summary is that the per-direction rescaling that makes Adam so robust to tune is the same thing that occasionally costs it a little at the very end.
Why not just use the curvature?
A reasonable objection: if the trouble is curvature, why not measure it? Newton's method does exactly that, multiplying the gradient by the inverse Hessian (the matrix of second derivatives).
On a convex quadratic this converges in one step; it sees the bowl's exact shape and jumps to the bottom. The reason deep learning doesn't use it is brutal arithmetic. For parameters the Hessian has entries to store and costs on the order of to solve. At in the billions, is already impossible. So we settle for first-order methods that only ever touch the gradient, and we spend our cleverness, momentum, per-direction scaling, bias correction, approximating what Newton would have told us for free if only we could afford it.
There's a nice unification hiding here, and it's the kind of cross-field rhyme this site keeps running into. Minimising means finding where , which is a root-finding problem on the gradient. Newton's method for optimisation is just Newton's method for root-finding applied to , and stochastic gradient descent is a stochastic root-finder, the original setting of that 1951 Robbins-Monro paper. Seen from the right angle, SGD and Newton's method are the same algorithm wearing different budgets. Some food for thought next time someone calls one "first-order" and the other "second-order" as if they were unrelated.
The loss surface, by the way, is a dynamical system: run the update and the parameters trace a flow, and the fixed points, minima and saddles, are exactly the equilibria we classify in the phase portraits post. Everything that learns on this site, the attention layers, the zoo of architectures, the KV-cache tricks, got that way by rolling downhill on a surface like the ones in the lab above. Set the preset to Bumpy, drop the learning rate, and watch the optimisers pick their way past the little traps. It's the same 1847 idea, still doing the work.
Reading further
- Cauchy, Méthode générale pour la résolution des systèmes d'équations simultanées (1847). Comptes Rendus 25, 536-538. Two and a bit pages; the origin of steepest descent.
- Robbins & Monro, A Stochastic Approximation Method (1951). Annals of Mathematical Statistics 22, 400-407. The ancestor of SGD, and the source of the step-size conditions above.
- Sutskever et al., On the importance of initialization and momentum in deep learning (2013). ICML. The paper that showed momentum, done right, matters for deep nets.
- Kingma & Ba, Adam: A Method for Stochastic Optimization (2015). ICLR. The optimiser you actually use, equation (4) and all.
- Dauphin et al., Identifying and attacking the saddle point problem (2014). NeurIPS. The argument that saddles, not local minima, dominate high-dimensional loss surfaces.
- Wilson et al., The Marginal Value of Adaptive Gradient Methods (2017). NeurIPS. The case that well-tuned SGD can still beat Adam at generalisation.
Try it in the lab
All effects →Gradient Descent
aiSGD, Momentum, RMSProp, and Adam racing down a loss landscape — ravines, saddles, and local minima.
optimizationdeep-learningtrainingInverse Kinematics
engineering2R planar robot arm solving for joint angles via analytic IK — drag the end-effector.
roboticskinematicsTransmission Line Pulse
engineeringTDR — a voltage pulse travels, reflects, and inverts on a mismatched line.
rftdrimpedance
More from the blog
Four ways to shrink a KV cache
A transformer's KV cache is a four-dimensional tensor, and every compression trick — quantisation, eviction, cross-layer sharing, linear attention — attacks one of its axes. Here is the tour, and the cautionary tale of a tiny code model whose accuracy fell 20 points because a smoke test never exercised the one axis that bites.
A* Search, Visually: the Heuristic Is the Whole Game
A* is not a clever algorithm so much as Dijkstra plus a bet about the future. The same code becomes Dijkstra, greedy best-first, or A* depending on one term in the priority key — and admissibility is the single property that buys optimality.
B-Trees vs LSM-Trees: The Two Religions of On-Disk Data
Every database you use bets on one of two storage engines: B-trees (read-optimised, update-in-place) or LSM-trees (write-optimised, append-and-compact). The choice isn't about speed but about which kind of amplification you're willing to pay.