Scaling Laws and the Bitter Lesson

In 2019, a researcher named Richard Sutton — one of the people most responsible for reinforcement learning becoming a thing — wrote a short essay called "The Bitter Lesson." It's about 1,500 words. It's probably the single most important essay about AI research in the last twenty years. And its thesis is a punch in the stomach for anyone who has spent their career trying to be clever about AI.

The thesis is this: every time researchers try to build knowledge about a problem into an AI system, and every time they try to let the system figure it out by brute computation, the brute-computation approach eventually wins. Not just wins — wins by a landslide, and makes the clever approach look like a historical curiosity.

Sutton traces this pattern across chess, Go, speech recognition, computer vision, and machine translation. Every single time, the story was the same: early researchers poured years into carefully hand-designed features and domain knowledge; later researchers threw that out and used big, dumb, general-purpose learning on lots of data; the later approach eventually outperformed the earlier one so badly that the early work stopped being cited. Every single time. Sutton wrote his essay before LLMs had their ChatGPT moment. Three years later, that moment confirmed his thesis so hard it's almost embarrassing.

In this post — the last one of Module 4 — we're going to walk through the bitter lesson, meet a closely related idea called scaling laws, and think about what both of them mean for the near future of AI. This is the post that pulls a lot of earlier threads together. No calculus. No code. One uncomfortable idea.

The bitter lesson, in one example

Here's the shape of every cycle Sutton described. Let me walk through one concrete version: computer chess.

For decades, researchers tried to encode human chess knowledge into chess programs. Opening books. Positional heuristics. Hand-crafted evaluation functions that said things like "a knight in the centre is worth 0.3 pawns more than a knight on the edge." Entire books were written about how to encode grandmaster intuition into code. Teams spent years on it.

Then, in 1997, IBM's Deep Blue beat Garry Kasparov. Deep Blue had some of the clever heuristics, yes — but most of its strength came from one thing: it could evaluate 200 million positions per second. It was simply computing more than any human could. The hand-crafted chess knowledge helped, but the brute force helped more.

Twenty years after that, in 2017, AlphaZero came along and beat the best traditional chess engines. AlphaZero did not use hand-crafted chess knowledge at all. It learned entirely from self-play, using a generic reinforcement learning algorithm. No opening books. No grandmaster heuristics. Just neural networks, self-play, and enormous compute. It destroyed the previous state of the art.

The researchers who had spent careers encoding chess knowledge weren't wrong. Their work had been state of the art for decades. They were just on the losing side of the bitter lesson. Every year, compute got cheaper. Every year, data got more abundant. Eventually, the simple-plus-big approach crossed a threshold where it could do what the clever approach did, and then it kept going, and the clever approach got left behind.

The same pattern happened in speech recognition (hand-crafted acoustic models gave way to neural networks trained on huge corpora), in machine translation (hand-crafted rules gave way to statistical methods which gave way to neural methods which gave way to transformers), and in computer vision (hand-crafted feature detectors gave way to CNNs trained on ImageNet, as we saw in M3.4). Every time, the clever crowd had their decade. Every time, the scale crowd eventually took over.

Sutton's bitter lesson is the observation that this has happened so often, across so many subfields, that it's probably not a coincidence. There's something about the nature of the problem that rewards scale over cleverness. Researchers hate it because their cleverness was the thing they loved doing. But the results don't care.

Scaling laws: the bitter lesson gets quantified

For a long time, "bigger works better" was folklore. Then, starting around 2017 and accelerating through 2020, researchers began to measure how much better, and they found something unnerving: the improvements weren't random. They were predictable.

The central finding was a set of equations called scaling laws. The headline result is this: as you increase the amount of compute, the amount of data, and the number of parameters used to train a language model, the loss on held-out data drops along a smooth, predictable curve. Not a line, exactly — more like a gently bending curve — but smooth. And the curve keeps bending in the "better" direction long past the point where you'd expect diminishing returns.

Put differently: if you give me the size of the model, the amount of data it was trained on, and the amount of compute used, I can predict, before training starts, roughly how good the model will be at predicting the next word. Not exactly. But within a range that's way tighter than anyone expected before scaling laws were identified.

This was a shock. It meant that researchers didn't need to be clever to make progress anymore — they just needed to be patient and rich. You could literally draw a chart, identify where you wanted to be, calculate the budget to get there, and go. Every major lab started doing this. Every frontier model since GPT-3 has been trained using scaling laws as a sizing tool.

There's a famous chart from 2022, known as the Chinchilla paper, that refined the original scaling laws. It showed that for a given compute budget, you should train a smaller model on more data, not a bigger model on less data. This one finding immediately changed how the industry allocated its training runs, and basically every modern frontier model is trained "Chinchilla-optimal" — picking the model size and data amount that scaling laws predict will use the compute best.

Think about what this means. You have a research field where progress is so predictable that you can buy it with a credit card. Nobody is surprised by how good the next model is; they can estimate it in advance. The challenge is not "can we build something smarter?" — it's "can we afford a bigger training run, and can we find enough data to feed it?"

That is a wild place for an entire scientific field to be in, and it is the bitter lesson weaponized.

Why does this keep happening?

It's worth pausing on the deep question: why does scale beat cleverness? Why is the world arranged such that a big dumb model trained on everything outperforms a small smart model trained on the right things?

Sutton's own answer is a mix of two observations.

Observation 1 · general methods scale. Clever methods don't. When you hand-code knowledge into a system, the benefit is limited by how much knowledge you can encode. A human team can only write so many heuristics. A general learning method, on the other hand, gets better as you give it more data and more compute — and those are things that grow exponentially cheaper every year. So over time, the general method's curve crosses the clever method's plateau, and then it keeps going while the clever method stays stuck.

Observation 2 · the clever insights we encode are almost always wrong or incomplete. Human intuitions about how to solve a problem are themselves approximations. When we hard-code "a knight in the centre is worth 0.3 pawns," we're encoding a shadow of the real answer. A learning system, with enough data, eventually finds the real answer — which is messier, more contextual, and better than the shadow. Our cleverness is, in hindsight, usually revealed to be a first draft that the learner improves on.

Those two observations together explain why the pattern is so unreasonably consistent. It's not that cleverness is bad. It's that cleverness scales linearly with human effort while general methods scale exponentially with compute, and compute has been getting cheaper every year for seventy years. Bet against compute at your peril.

A lot of AI research careers have been spent betting on cleverness and losing. A lot of others have been spent betting on scale and winning. If you've ever wondered why the people running the big labs seem so confident about continuing to spend billions of dollars — it's because the scaling curves have kept holding. Every generation, the same shape of chart. Every generation, cheaper compute and more data push the frontier further. Every generation, clever alternatives either get absorbed into the general method or get left behind.

The caveats nobody wants to put in the press releases

Scaling isn't infinite, and the bitter lesson isn't universal. Three caveats are worth knowing about, because they're where the current research frontier actually sits.

Caveat 1 · data is running out. Scaling laws tell you how much data you need to match a given model size. But the amount of high-quality text humans have written is finite. Current frontier models have already been trained on most of the publicly accessible internet. Running another generation at the current data-to-model ratio would require roughly 10 times more text than exists. Labs are hitting this wall and are scrambling for solutions: synthetic data generation, data from other modalities (images, video), harder-to-access sources (private datasets, books, code archives), and better use of the data they have. Nobody knows whether the scaling curves will hold once "just get more internet" stops being an option.

Caveat 2 · compute is running out too. Not at the same rate as data, but close. The biggest training runs today use entire data centers. Going 10 times bigger is a multi-billion-dollar commitment before you've even trained the thing. The number of organizations that can afford to train a frontier model from scratch is, at this point, very small. We may be reaching a point where each generation of scale-up is done by one or two players, and the gap with everyone else widens.

Caveat 3 · scaling laws describe pretraining, not the things you actually want. Scaling laws predict how well the model will do at predicting the next word. That's correlated with how useful the model is, but it's not the same thing. For some capabilities — honesty, reasoning on unfamiliar problems, knowing when it's wrong — scale alone isn't obviously doing the trick. This is where researchers are investing a lot right now: can we find other levers besides size, data, and compute that move the needle on capability? Early results are mixed. Some techniques — better training data, better post-training, better prompting — do meaningfully help. Others don't. Nobody is sure yet how much capability is gated by scale alone versus scale plus technique.

The honest summary of where we are: scaling has been the main thing for 5-10 years; it will probably still be an important thing for the next few years; but the bets people are placing now are about what comes next, because scaling alone can't carry the field forever.

What this means for you

If you take one practical thing out of this whole post, take this: be skeptical of anyone who tells you they've found a clever shortcut that makes a smaller model as good as a bigger one. Sometimes it's true. Usually it's not. The bitter lesson keeps being bitter because people keep pitching the same alternatives and the alternatives keep quietly losing over 18 months.

And be skeptical, equally, of anyone who tells you that the current scaling curves will keep going forever with no ceiling. We're hitting walls on data, compute, and power supply all at once. Something has to give. The next few years of AI will be shaped as much by how we get past those walls — synthetic data, new architectures that scale differently, specialised models instead of one giant general one — as by raw model size.

Module 4 ends with this honest tension. The bitter lesson has been ruthlessly correct for thirty years. The scaling laws have been shockingly regular for five. And yet the specific regime we've been in might be ending. What comes after scaling? is the question every serious researcher is working on right now. Nobody has a clean answer yet. We'll see together.

What just changed in your head

You started this post thinking of AI progress as a bunch of disconnected breakthroughs. You're ending it with a single structural observation: general learning on lots of data and compute, applied without trying to be too clever, has won every direct comparison it's ever been in. Not because cleverness is bad, but because cleverness can't grow as fast as compute can.

One sentence worth carrying forward:

Every time in AI history you could have bet on clever algorithms versus bigger computers with simpler algorithms, the bigger computers have eventually won. That fact is the scaffolding of modern AI.

Hold onto that. It's the thing that explains why labs spend billions on training runs, why architectures converge (everyone is using some variant of the transformer now), and why the next five years will probably look like more of the same, scaled up, until the data or the compute runs out.

That's Module 4. Neural networks, stacked into transformers, scaled up by scaling laws, polished by a three-act training story, and powered by the bitter lesson. Everything you use today — ChatGPT, Claude, Gemini, Copilot, and a hundred other tools — sits on top of that stack.

In Module 5, we zoom out from how LLMs work to how to use them well. Prompting as a skill. Retrieval-augmented generation. The decision framework for when to fine-tune vs prompt vs retrieve. How to evaluate whether an LLM is any good, which turns out to be shockingly hard. And an honest look at where LLMs still go wrong: hallucination, sycophancy, bias, jailbreaks. If Module 4 was about the machinery, Module 5 is about living with what it produces.

See you there.

⬅️ Previous	📍 You are here	Next ➡️
⬅️ Previous M4.4 · Pretraining, Finetuning, RLHF	M4.5	Next ➡️ M5.1 · What an LLM Is Actually Doing

📚 AI Zero to Hero · Course Home — all 33 posts, six modules.

Cover photo via Unsplash. This post is part of the AI Zero to Hero series.

Scaling Laws and the Bitter Lesson

The bitter lesson, in one example

Scaling laws: the bitter lesson gets quantified

Why does this keep happening?

The caveats nobody wants to put in the press releases

What this means for you

What just changed in your head

Course navigation

Comments

AI Zero to Hero

What an LLM Is Actually Doing — Token by Token

More from this blog

A Reading List and Two Habits: Staying Current in Ten Minutes a Week

What to Decide Now, What to Defer, What to Ignore: The AI Action Matrix

The Next 18 Months of AI: A Calibrated Leader's Forecast

Calibrating Your AI Exposure: Upside and Downside in One Matrix

Five AI Capabilities That Matter for Your Business, and Five That Do Not

Command Palette

The bitter lesson, in one example

Scaling laws: the bitter lesson gets quantified

Why does this keep happening?

The caveats nobody wants to put in the press releases

What this means for you

What just changed in your head

Course navigation

Comments

AI Zero to Hero

What an LLM Is Actually Doing — Token by Token

More from this blog