Skip to main content

Command Palette

Search for a command to run...

Training a Neural Network — Rolling Downhill, Blindfolded

A neural network learns by feeling the slope under its feet and taking a tiny step downhill. Do that a billion times and you have a trained model. One hill, no calculus.

Updated
11 min read
Training a Neural Network — Rolling Downhill, Blindfolded

Imagine you're dropped somewhere on a vast, crumpled landscape of hills and valleys. It's pitch dark. You can't see more than a foot in any direction. Your job is simple: get to the lowest point you can find. No map. No compass. Just you, the ground, and whatever direction gravity seems to be pulling your body.

What do you actually do?

You feel the tilt of the ground under your feet. You take one small step in whichever direction felt most downward. Then you feel the tilt again. Then you take another step. Then again. And again. And again.

You might not reach the absolute lowest point. You might end up in some local dip that isn't the deepest. But if you keep doing this, hundreds or thousands or millions of times, you will end up in a pretty low spot — a lot lower than where you started. Not by knowing where the bottom is. Just by always taking the next step in whichever direction looks downhill right here, right now.

Congratulations. You've just trained a neural network. No calculus. No code. One hill.


The landscape is real, sort of

Let's make the metaphor a little more concrete before we take it seriously. When we say a network is training, what's actually happening?

Recall from the last post that a neural network is a stack of layers, and each layer is a row of tiny voting committees, and each committee has a bunch of weights attached to its inputs. Those weights start as random numbers — meaningless noise. The whole job of training is to turn those random numbers into good numbers that make the network produce the right answers.

"Good" is measured by the loss — a single number that says how wrong was the network on this training example? You met the loss in M2.1: it's the grader. Every time the network takes a guess, the loss spits out a number saying how far off it was.

Here's the key mental move. Imagine every possible setting of every weight in the network laid out as a giant coordinate grid. Each point on the grid is one specific choice of all those weights. Now, at every point on that grid, there's an altitude: the loss the network would get with those weights, averaged over all the training examples. Good settings of the weights have low altitude. Bad settings have high altitude. The whole grid, with altitudes attached, is a loss landscape — a crumpled surface of hills and valleys in a space with millions or billions of dimensions.

Training is the walk on that landscape. You start at a random point, high up somewhere in the hills, and you take tiny downhill steps until you reach a low-loss valley. When you're there, the weights are good, the network does its job, and you ship it.

Here's the honest one-liner to carry:

Training a neural network is walking downhill on a loss landscape — starting from a random point and taking tiny steps in whichever direction makes the loss a little smaller.

That's the whole story. Everything else is just "how do you tell which way is downhill?" and "how big should your steps be?"


Feeling the slope: the gradient in plain English

So, how does the network know which way is downhill? It can't see the landscape. It only has one number at a time: the loss on this batch of training examples, with the current weights.

The trick is called a gradient, and despite the name it's not scary. A gradient is just a list that says, for every weight in the network: if I nudge this weight a tiny bit up, does the loss go up or down, and by how much?

Think of it like this. You're standing on the hillside, blindfolded. You stick one foot out in each of several directions, feel whether each of those directions slopes up or down, and how steeply. The gradient is that whole report: for every possible direction of small movement, it tells you the local slope. Armed with that report, you know exactly which combined direction is the steepest downhill — and that's the direction you take your step.

For a modern neural network, the gradient is a list of hundreds of billions of numbers, one for each weight. Computing it might sound impossible, except there's a beautiful algorithm for doing it efficiently. It's called backpropagation, or just backprop. The name is intimidating; the idea is simple. Run the network forward to get a loss. Then walk backwards through the layers, using high-school-level arithmetic at each step, to figure out how much each weight contributed to that loss. The contributions for every weight in every layer fall out in one backwards pass through the network.

You don't need to remember how backprop works in detail. You need to remember two things:

  1. Backprop is how the network tells, for every weight, which direction makes the loss a little smaller.
  2. It does this in one efficient pass through the network, not by nudging weights one at a time. That efficiency is the only reason deep learning is computationally possible.

That's it. Backprop is the "feel the slope under every one of your feet at once" step. The gradient is the report. Gradient descent is the little downhill step that follows.


Step size: the one knob that ruins everything

Okay, you know which way is downhill. How big a step should you take?

This is the single most-tuned parameter in all of deep learning, and it has a very academic name: the learning rate. Ignore the name. It just means: how long is a step?

Try this thought experiment. You're on a hill, blindfolded, trying to reach the bottom.

  • Huge steps. You know which direction is down. You take a massive stride. But because your steps are so big, you fly right over the valley and land on the hill on the other side — possibly higher up than where you started. Then you take another huge step back, and overshoot again. You bounce around forever and never settle into the low point. Training is unstable and goes nowhere.

  • Microscopic steps. You shuffle an inch at a time. You never overshoot, but reaching any destination takes an eternity. Training is stable but painfully slow, and might time out before you get anywhere useful.

  • Goldilocks steps. Just the right size. Big enough to make real progress, small enough not to overshoot. You cruise down into a valley in a reasonable amount of time, and you settle there.

The learning rate is the step size, and getting it wrong is the single most common way a training run fails. Too big, the loss jumps around randomly and never settles. Too small, the loss just crawls down forever and you give up. Modern techniques have gotten clever about this — starting with bigger steps early and shrinking them as the network approaches a valley — but the core trade-off never goes away.

If you ever hear a machine learning practitioner say "the training diverged," what they mean is "the steps were too big and the loss exploded." If you hear "the training plateaued," it usually means "the steps were too small or we hit a flat part of the landscape." The landscape metaphor keeps earning its keep.


The landscape is weird, and that matters

Now the uncomfortable honest part. That loss landscape is not a nice smooth bowl with one clean bottom. It's a high-dimensional, crumpled, mostly-empty space with an enormous number of local dips, ridges, plateaus, and weird shapes that don't exist in two-dimensional hills. Your intuition of "a hill with one bottom" is useful for getting started but wildly misleading about what the real thing is like.

Three honest observations about real loss landscapes:

Local minima are mostly fine. An old worry was that your walk would get stuck in some local dip that isn't the deepest, and you'd end up with a mediocre network. It turns out, in very high dimensions, most local minima you reach are approximately as good as each other. The absolute best one is very hard to find, but you almost never need it. Any of the decent valleys will give you a working model. This is a mathematical coincidence that deep learning basically lucked into, and nobody fully predicted it.

Saddle points slow you down. A saddle point is a place that slopes down in some directions and up in others — like a horse saddle. On the way to a valley, you'll pass through a lot of these, and the local slope gets small and confusing, so your blindfolded walker slows way down. A lot of training "plateaus" are actually saddle points, not dead ends. Keep walking; you'll usually find a way down.

The landscape is shockingly forgiving. Given how crumpled and weird these landscapes are in theory, the fact that downhill-walking works at all is a mild miracle. People are still actively trying to understand why. For now, the working summary is: deep networks don't train because we cleverly navigate the loss landscape. They train because the landscape, for reasons we don't fully understand, is kind enough to reward plain-vanilla downhill walking from almost anywhere you start.


What you're actually paying for when you "train a model"

Here's the whole arc of a training run, in one breath. You start with a random set of weights and therefore a random point on the landscape, usually high up. You run a batch of training examples through the network forward and get a loss. You run backprop to get the gradient — which way is downhill for every weight. You take a tiny step in that direction. You do it again with the next batch. And again. And again.

Modern training runs do this millions or billions of times. Every step is cheap on its own — some multiplications, some additions, some squishing. The cost is that there are so many steps. When someone tells you a frontier model "cost forty million dollars to train," they mean: we walked downhill on a loss landscape for three months across tens of thousands of GPUs, each step computed a fresh gradient on a fresh batch of data, and at the end the weights were in a low-loss valley we liked.

That's it. That's the expensive part of AI. Not the model architecture. Not the code. The walk.

It's also why data and compute are the main levers of modern AI. More data means a better-shaped landscape (the grader is giving you truer feedback about what "downhill" really means). More compute means more steps per dollar, which means you can afford to climb out of saddles, try more starting points, and reach lower valleys. Everything else — cleverer architectures, better learning-rate schedules, fancy optimizers — matters at the margin, but the two levers you can actually spend money on are "how much data" and "how many steps."


What just changed in your head

You started this post thinking of "training a neural network" as some obscure engineering ritual performed by people with PhDs. You're ending it with a picture of a blindfolded walk down a crumpled hillside — one tiny downhill step at a time, a billion times in a row, until the walker is low enough to stop.

Two sentences worth walking away with:

Training is a walk downhill on a landscape of possible weight settings, where altitude equals loss. The walker can't see — it can only feel the slope under its feet, one step at a time.

Everything in training — the gradient, backprop, the learning rate, the cost of compute — is just machinery for answering "which way is down from here?" and "how big a step should I take?"

Hold onto those, because they're literally the last big unknown in how neural networks work. With the neuron in your head, the ladder from the last post, and the downhill walk from this one, you have the whole recipe. Everything we see from here is a specific shape of network tuned for a specific kind of data.

In the next post, we meet the first famous one: the convolutional neural network, or CNN. It's the architecture that cracked computer vision open in 2012, and the intuition is gorgeous — a little flashlight that slides across an image looking for the same kinds of small patterns everywhere it goes. No calculus. No code. One flashlight.


Course navigation

⬅️ Previous📍 You are hereNext ➡️
⬅️ Previous
M3.2 · Why Stacking Layers Works
M3.3Next ➡️
M3.4 · CNNs — With a Flashlight

📚 AI Zero to Hero · Course Home — all 33 posts, six modules.


Cover photo via Unsplash. This post is part of the AI Zero to Hero series.

More from this blog

Learn AI - Zero to Hero

111 posts