Skip to main content

Command Palette

Search for a command to run...

Features, Labels, and the Data Diet

A great chef with bad ingredients still serves a bad meal. Machine learning has its version of that law — and once you see it, you understand why 'garbage in, garbage out' isn't a slogan, it's the whole story.

Updated
10 min read
Features, Labels, and the Data Diet

Imagine the best chef you've ever heard of. World-class. Decades of training. Instincts you can't teach. Now imagine you send her to a kitchen where the ingredients are wrong — the butter's rancid, the tomatoes are out of season, half the spice rack is mislabelled, and the chicken is… probably fine. Ask her to make dinner.

She can't save it. Not really. She can make the best possible meal given those ingredients, and it will still be mediocre, because there is no amount of technique that rescues bad inputs. Every experienced cook has learned some version of this the hard way: the dish is downstream of the pantry.

Machine learning has its own version of this law, and it's the quiet punchline of Module 2. We've spent five posts talking about how models fit shapes to examples, how labels teach, how structure emerges, how rewards sharpen, how overfitting lies to you. All of that assumed something we haven't properly examined: the examples themselves. What's actually on the plate.

This post is about the plate. No math. No code. One pantry.


Every example is two things: features and a label

Zoom in on a single example in a training set. A photo of a cat with the label "cat." A row in a spreadsheet about a customer: age 34, city Berlin, plan "pro", churned: no. An email with the label "not spam."

Every example in supervised learning splits into two parts.

Features are the things you chose to measure about this example — the ingredients you put in the bowl. For a photo, the features are the raw pixel values. For a customer, the features are whatever columns you pulled from the database: age, plan tier, location, tenure, last login date. For an email, the features are the words and headers and maybe a handful of derived numbers like "does the sender address match the display name." Features are the description you're offering the model.

The label is the thing you want the model to predict — the correct output for that example. Cat or not cat. Churned or not churned. Next month's revenue. Tomorrow's temperature. The label is what turns a pile of raw data into a teaching moment.

And here's the quiet truth: the features you pick, and the quality of the labels you attach, together determine almost everything about what the model can possibly do. The architecture matters. The training algorithm matters. But they matter after the data matters, and by a smaller margin than most people think.


Garbage in, garbage out — literally

"Garbage in, garbage out" is one of those phrases that sounds like a folksy slogan until you realise it's a precise claim about what ML is actually doing.

Recall from M2.1 that a model learns by fitting a shape to its training examples. That sentence hides a brutal implication: the model can only fit the shape of the data you actually gave it. If your data is missing a feature that matters, the model cannot see it. If your labels are wrong, the model will happily learn to produce wrong labels. If your examples are drawn from a narrow slice of the world, the model will only be good within that slice. The grader — the loss — rewards matching the training data. Nothing else.

This means every mistake, distortion, or omission in the data gets baked into the model. Not metaphorically. Literally. Here are the everyday flavours of this, each responsible for a lot of ML pain:

Missing features. You tried to predict customer churn but forgot to include whether they'd contacted support in the last week — the single best predictor in your data. The model does its best with what's left and shrugs. No architecture trick rescues this. You had to put the tomato in the pot.

Wrong labels. Your spam dataset was labelled by interns who disagreed with each other about what "spam" meant. The model learns to match the intern's judgement, not "spam," and behaves weirdly in production because the interns were never entirely consistent. Wrong labels don't just add noise — they teach the model the wrong shape.

Biased sampling. Your face-recognition system works great on faces that look like the faces in your training set and dramatically worse on faces that don't. The model didn't fail; the dataset did. It only ever learned the shape of the faces you showed it, and those faces weren't representative of the world it eventually deployed into. This is the failure mode behind many of the most embarrassing AI incidents of the last decade.

Stale data. You trained a product recommender on last year's purchases, then deployed it today when tastes have shifted. The shape is fine — for the world it was fit to. The world has just moved.

Leaked features. You included a feature that secretly encodes the answer. A classic: predicting whether a customer will churn, using a feature that was itself derived after knowing whether the customer churned. The model looks spectacular in training (because it's basically cheating) and collapses in production (because the cheat isn't available there). This is the saddest one, because it looks like success right up until it doesn't.

Every one of these is a kind of garbage going in. And every one produces, with unnerving reliability, a kind of garbage coming out — not always the same kind as the input, but always some kind.


Why "more data" isn't the magic word

If you hang around ML long enough, you hear people say "we just need more data" as if it were an incantation. Sometimes it is. Often it isn't.

The honest rule is: more of the right data is magic; more of the wrong data is noise. If your dataset is biased, adding more biased examples will make the model more confident in its biased shape, not less. If your labels are wrong, adding more wrong labels entrenches the wrongness. If the feature that actually matters is still missing, adding a million more rows that also lack it changes nothing.

So the question experienced practitioners ask is never just "can we get more data?" — it's:

  • More data of what? Examples we're missing? Edge cases we're weak on?
  • How will it be labelled? By whom? How consistently?
  • Where will it come from, and how does the sampling compare to the world we'll deploy into?
  • What features will we have on this new data that we don't have now?

Which is the same thing as saying: the pantry, not just the quantity. A chef with ten pounds of bad butter is not better off than a chef with one pound of good butter. She's actively worse off — she'll probably waste time trying to make the bad butter work.

This is also why the most valuable thing at many ML companies isn't their model or their algorithm — it's the pipeline that produces their training data. Cleaning, labelling, validating, de-duplicating, monitoring for drift. It's unglamorous. It's most of the job. It's also where most of the wins come from.


The data diet decides the model

Here's the pattern that holds across everything we've seen in this module. Look at the three flavours of learning through the lens of data:

In supervised learning, the labels are the teacher — and the labels are data. Whoever wrote the labels is, in a very real sense, the real teacher of the model. If the labels encode a human bias, the model inherits that bias. If the labels were rushed, the model learns a rushed version of the task. The model is a mirror of its label-writers, at scale.

In unsupervised learning, there are no labels, but the features are everything. The embeddings the model builds — the map of how things relate — are a direct projection of what you chose to describe each thing by. Change the features, change the map. There is no "objective" unsupervised result; there's only the result of this pantry.

In reinforcement learning, the "data" is the stream of actions and rewards the agent sees. Whoever designed the reward function is, in effect, writing the labels — and the agent will pursue that reward with unsettling single-mindedness. Almost every story about RL going weird ("the boat-racing agent that learned to spin in circles collecting bonuses instead of finishing the race") is a story about a badly designed reward, which is a story about badly designed data.

The pattern across all three: the data you pick is not a neutral input. It is the single biggest design choice you make about what your model will become. The architecture is how you cook. The data is what you cook with. And we all know, in our bones, which one matters more.


The humbling takeaway

Here's the sentence that haunts every serious ML person, the one that eventually replaces the early excitement about clever algorithms:

Most of the job isn't building the model. Most of the job is deciding what to feed it.

This is strange, because it's not what the field looks like from the outside. From the outside, ML looks like heroic engineering on clever architectures. From the inside, ML mostly looks like: figuring out what we actually want to predict, tracking down the right data, cleaning it, labelling it, noticing it's biased, going back and fixing the bias, discovering a new feature that matters, getting more of the data we're weak on, retraining, finding that the old labels were wrong, relabelling, retraining again. The model itself is, often, a thirty-line call to a library. The pantry is the whole career.

You will read AI news forever with more confidence if you hold this in your head. When a system works well, the interesting question is usually "what did they feed it?" When a system fails embarrassingly, the interesting question is usually "what was missing from what they fed it?" The architecture headlines — new model, new trick, new million parameters — are real, but they rest on top of a mountain of data decisions that almost nobody talks about and almost everybody underestimates.


What just changed in your head — and what Module 2 just gave you

You started Module 2 with "learning" as a vague, brain-adjacent word. You're ending it with a much sharper picture.

  • M2.1 — learning = fitting a shape to examples.
  • M2.2 — supervised = every example comes pre-labelled.
  • M2.3 — unsupervised = no labels, but structure emerges anyway (clusters, similarities, embeddings).
  • M2.4 — reinforcement = learn what to do from rewards over time.
  • M2.5 — the one universal failure mode is overfitting: memorising instead of generalising.
  • M2.6 (this post) — the data diet decides what any of this can become.

Six posts, one mental model. When you read an AI headline now, you can sort it into one of the three flavours, recognise whether the claim is susceptible to overfitting, and ask the single most important question about any ML system: what did they feed it, and who decided?

That's a real superpower. It's also almost exactly what a good researcher actually does when they hear about a new AI result: sort the flavour, check the evaluation, ask about the data. The jargon that gets layered on top — loss functions, architectures, hyperparameters — is detail. The three questions underneath are the whole game.

In Module 3, we finally open the box most people mean when they say "AI" — neural networks. We'll see why they're called networks, why "neurons" is a lie, why stacking layers matters, and why this particular shape of model turned out to handle messy inputs better than everything that came before. The fit-a-shape metaphor you've just absorbed will carry all the way through. Only the shape is about to get a lot more interesting.


Course navigation

⬅️ Previous📍 You are hereNext ➡️
⬅️ Previous
M2.5 · Overfitting
M2.6Next ➡️
M3.1 · The Neuron Is a Lie

📚 AI Zero to Hero · Course Home — all 33 posts, six modules.


Cover photo via Unsplash. This post is part of the AI Zero to Hero series.

More from this blog

Learn AI - Zero to Hero

111 posts