Gradient descent, by hand.

Every neural network you have ever heard of was trained by walking downhill on a loss surface. The downhill direction is the negative gradient. The step length is the learning rate. The shape of the surface decides whether that recipe is gentle or violent.

Below, four canonical 2D surfaces and four canonical optimizers. Click anywhere to drop a marble. Slide the learning rate up until things get violent.

A bowl, a marble, a learning rate.

Click anywhere on the surface to drop a starting point. The marble takes one gradient step at a time toward the lowest place it can see. Push the learning rate too high and it overshoots; push it too low and the run takes forever.

Loss surfacestep 0/200

A long thin valley. Plain SGD ricochets between the walls; momentum and adaptive methods slide along the floor.

Three things to look for as you play. The ravineis the canonical exhibit: along the long axis the gradient is tiny, so SGD inches forward; across the short axis it’s huge, so SGD bounces between the walls. Momentum and Adam both fix this, but for different reasons.

On the saddle, watch what happens when you start exactly on the y-axis. The x-gradient is zero, so SGD just sits there forever. In a real high-dimensional loss surface, saddles are everywhere, and noisy gradients are usually what nudges you off them.

The Rosenbrock banana is famous because the optimal path follows a curving floor with steep walls. First-order methods all struggle with it. Second-order methods (which this explorable doesn’t cover) are why people occasionally still bring out L-BFGS.

Real loss landscapes live in millions of dimensions, not two. What survives the dimensional jump is the intuition: small steps in the steepest direction, occasionally with memory, occasionally with per-dimension scaling.