Reinforcement Learning — The Dog-Treat Model of Learning

A puppy sits in the middle of the kitchen, staring at you. You say "sit." The puppy blinks. You say it again, slower, with more conviction. The puppy scratches an ear. You say it a third time. The puppy, for reasons that will remain mysterious, finally lowers its back end to the floor — and you produce a treat as if you'd summoned it from another dimension.

Five minutes later, you say "sit" and the puppy sits. Not perfectly. Not every time. But often enough that you're hooked.

Here's the strange part. You never explained what "sit" means. You didn't show the puppy a picture of itself sitting. You didn't write down a rule. There was no classifier, no embedding, no label. There was just this: something happened, and then a treat appeared, and now it happens more often.

That is reinforcement learning. No math. No code. One puppy.

So far in Module 2 we've met supervised learning (labels as the teacher) and unsupervised learning (no labels, just structure). This third flavour — reinforcement learning, or RL — is the weirdest of the three and, in some ways, the most interesting. It's the mental model behind every game-playing AI of the last decade, every robot that learns to walk, and — increasingly — the final training step of every chatbot you've ever talked to.

The setup: not "what is this?", but "what should I do?"

Supervised and unsupervised both ask variations of "what is this?" — is this email spam, what group does this customer belong to, which word comes next. Reinforcement learning asks a different question:

What should I do?

That's the shift. The learner isn't classifying things. It's choosing actions. And the only feedback it ever gets is whether the action, eventually, made things better or worse.

Here's the whole shape in one picture.

Four words do all the work. The agent is whatever is learning — a puppy, a game-playing neural network, a robot arm, a delivery-route optimizer. The environment is whatever the agent is acting inside — the kitchen, a chess board, a warehouse floor. At every moment, the agent picks an action (sit, move pawn, turn left), and the environment responds with a new situation and a reward — a number, tiny or huge, positive or negative, saying "that was good" or "that was bad."

The agent's only job is to figure out which actions, taken in which situations, lead to the most reward over time. That's it. No labels. No structure-finding. Just trial, error, and a slowly sharpening sense of what pays off.

Here's the honest definition to hold onto:

Reinforcement learning is the flavour where an agent learns what to do by trying actions in an environment and getting rewards back.

Simple sentence. Harder problem than it looks. Let's see why.

What makes RL weirdly hard

At first glance RL sounds easier than supervised learning — at least you don't need a pile of labelled examples. But two things make it much trickier in practice, and naming them is half the battle.

The reward is delayed, and you don't know what caused it. When the puppy finally sits and gets a treat, it doesn't know whether the treat was for the sitting, the ear-scratching, the staring, or the fourth thing it did half a minute earlier. The agent has to somehow figure out which of its past actions deserved the credit — a problem that has an actual name: credit assignment. In chess, you win or lose after forty moves; which of those moves actually mattered? In a game, you die in a pit a minute after picking up the wrong sword; was it the sword that killed you, or the left turn three rooms back? Most of RL is, at heart, figuring out how to spread a single reward signal backwards across a long chain of decisions, fairly.

You have to try things to learn, and trying things is expensive. If the puppy never takes a random action, it will never accidentally sit, and will never get the treat, and will never learn. But if it spends all day taking random actions, it's also not going to improve. This tension — exploration vs exploitation — is the central dance of RL. The agent has to try new things to discover better strategies, and lean on what it already knows to actually collect rewards. Too much exploring, you never cash in. Too much exploiting, you never grow. Every RL algorithm has some version of a knob that balances the two, and tuning that knob well is often where the wins come from.

Both of these problems are solvable, and the last decade has been a long, slow story of solving them better. But they're the reason RL, for all its elegance, is the most fragile of the three flavours we're meeting. It takes more data, more compute, and more babysitting than supervised learning usually does. When it works, though, it works in ways the other flavours just can't.

Why games were the first big win

If you followed AI news over the last ten years, you probably noticed that every few months, a new headline would arrive about an AI beating humans at some game — Atari, Go, StarCraft, Dota, chess again but this time learning from scratch. There was a reason game after game fell first: games are the perfect environment for RL.

Every one of those four properties is exactly what RL needs and every one of them is hard to come by in the real world. A self-driving car can't play a million random games of "drive into a tree and see what happens." A robot surgeon can't explore its action space on real patients. Reality is slow, dangerous, and the reward signal for most real problems is noisy and far away.

Games have none of those problems. You can spin up a million copies of a game in parallel, play each to completion in seconds, and get a clean score every time. The agent can die a billion times without anyone minding. It can start from zero, play against itself, and bootstrap its way to superhuman — which is exactly what AlphaZero famously did with chess and Go: no human games in its diet at all, just self-play and a reward signal of "did you win."

The headline was "AI beat a grandmaster." The mechanism was "a reinforcement-learning agent played itself sixty million times in a giant training run and got really good at the game it found itself inside."

Hold onto this: RL shines brightest where the environment is cheap, fast, and safely repeatable. That's why games fell first, why robotics is slowly falling next (mostly via simulation), and why a lot of the real-world frontier is about making good enough simulators of the real thing so you can train in them before letting the agent touch reality.

The quiet RL inside every chatbot you use

Here's the part most people don't know. Every big chatbot you've used — ChatGPT, Claude, Gemini — has reinforcement learning baked into it, in a specific and clever way.

After the chatbot has been trained on a huge pile of internet text (that's the unsupervised part from the last post), it can produce fluent, plausible-sounding output. But plausible isn't the same as helpful, honest, or safe. Left alone, a raw language model is perfectly happy to make things up, to be condescending, to refuse easy questions, or to cheerfully explain how to do things nobody should explain. It's fluent in the same way a confident sophomore is fluent: very, and unreliably.

So the last training step, for every modern chatbot, is a flavour of RL. Human raters are shown pairs of answers to the same question and asked which one is better. Those preferences become a reward signal, and the model is nudged to produce more answers like the preferred ones and fewer like the rejected ones. Do this enough times and the rough, plausible-but-unruly model turns into a polished, helpful-by-default assistant. This whole process is called RLHF — reinforcement learning from human feedback — and it is one of the most consequential ideas in modern AI even though almost nobody outside the field has heard of it.

You can feel RLHF at work every time a chatbot politely refuses to do something, softens a blunt answer, adds a caveat, or chooses the clearer of two equally correct explanations. Those aren't hand-coded rules. They're the shape left behind by a reward signal trained on human preferences. The puppy in the kitchen, at vastly larger scale.

What's left to figure out

RL is the flavour that has travelled furthest from "learning" in the everyday sense. The puppy, the gamer, the robot — all three are figuring out what to do in a world that pushes back. And everything about that setting is harder than classifying pictures: rewards are sparse, simulators are imperfect, long-horizon planning is still clunky, and agents that look smart in one environment often fall apart in a slightly different one.

But RL is also where the most open-ended, genuinely surprising behaviour has shown up. Supervised models, at their best, are very good at mimicking the patterns in their training data. Unsupervised models are very good at finding structure in it. RL agents, when everything clicks, sometimes invent strategies nobody showed them — the famous Go moves that human masters had never considered, the game glitches an Atari agent discovered and exploited, the robot gaits that don't look anything like how biologists thought creatures should move. Given a reward signal and enough time, an RL agent is free to be weirder than its training data.

That's the part worth keeping an eye on. The flavour that feels least like teaching is also the one where the machine is most likely to surprise you with what it comes up with.

What just changed in your head

You started this post with RL as that one AI term that sounded like it was always about robots and Atari. You're ending it seeing it in a puppy, in every game-playing headline of the last decade, and — quietly — in the polish on every chatbot you've ever typed into. One mechanism, three very different costumes.

Here's the sentence worth walking away with: in supervised learning you learn to answer; in unsupervised learning you learn to see; in reinforcement learning you learn to act. Three flavours, one shared move — fit a shape to examples — but with examples that look wildly different in each case. Supervised: labelled dots. Unsupervised: unlabelled dots. RL: a whole sequence of actions followed, eventually, by a treat or a slap.

That's the three-way map of machine learning. Everything else in the field — the architectures, the algorithms, the frameworks — is how you fit the shape. The what we're fitting, and the what we're fitting it from, is one of these three.

In the next post, we finally meet the one failure mode that can ruin any of them. It's called overfitting, and it's the reason every experienced ML practitioner develops the same twitchy paranoia about their own results. Once you see it, you'll start noticing it everywhere — including in a lot of AI headlines that claim victories that didn't quite happen.

⬅️ Previous	📍 You are here	Next ➡️
⬅️ Previous M2.3 · Unsupervised Learning	M2.4	Next ➡️ M2.5 · Overfitting

📚 AI Zero to Hero · Course Home — all 33 posts, six modules.

Cover photo via Unsplash. This post is part of the AI Zero to Hero series.

Reinforcement Learning — The Dog-Treat Model of Learning

The setup: not "what is this?", but "what should I do?"

What makes RL weirdly hard

Why games were the first big win

The quiet RL inside every chatbot you use

What's left to figure out

What just changed in your head

Course navigation

Comments

AI Zero to Hero

Overfitting, Underfitting, and Why Your Model Lies to You

More from this blog

A Reading List and Two Habits: Staying Current in Ten Minutes a Week

What to Decide Now, What to Defer, What to Ignore: The AI Action Matrix

The Next 18 Months of AI: A Calibrated Leader's Forecast

Calibrating Your AI Exposure: Upside and Downside in One Matrix

Five AI Capabilities That Matter for Your Business, and Five That Do Not

Command Palette

The setup: not "what is this?", but "what should I do?"

What makes RL weirdly hard

Why games were the first big win

The quiet RL inside every chatbot you use

What's left to figure out

What just changed in your head

Course navigation

Comments

AI Zero to Hero

Overfitting, Underfitting, and Why Your Model Lies to You

More from this blog