Embeddings — Turning Everything into a Point in Space
Pick the one idea that shows up inside every modern AI system — LLMs, recommenders, search, vision. It's this one. Words are points. Images are points. You are a point. Here is why that matters.
Here is the weirdest thing modern AI does, and nobody talks about it enough.
If you ask an AI system about the word "king," somewhere deep in its guts, the word isn't stored as a word at all. It's stored as a list of numbers. A long list — maybe 768 numbers, maybe 4,096 numbers — that together describe a point in a space. The word "queen" is stored as a different point, nearby. "Banana" is a point, far away. Your face, if a face-unlock system knows it, is a point. A song you liked on a music app is a point. The last email you got is a point. Every interesting thing inside any modern AI system is, somewhere, a point in a space you can't easily draw.
Those points are called embeddings, and they are — I'll defend this — the most important single idea in modern AI that isn't "neural network."
We've brushed against embeddings twice already in this course: once in M2.3, when we talked about party guests becoming points, and once in the abstraction ladder in M3.2, when the middle layers of a network were quietly building numerical descriptions of things. In this post, the last of Module 3, we meet embeddings head-on. Why they work. Why they keep showing up. And why, once you know what an embedding is, you start seeing them everywhere. No calculus. No code. One map.
The trick: turn things into coordinates
Imagine you're cataloguing every fruit in a grocery store. You want a system where similar fruits are easy to compare — where "apple" and "pear" feel close, "apple" and "mango" feel farther, and "apple" and "onion" are basically on different continents.
One way to do this is to describe every fruit along a few axes. Pick three dimensions:
- Sweet → sour (a number from 0 to 10)
- Soft → crunchy (a number from 0 to 10)
- Round → long (a number from 0 to 10)
Now every fruit is just a triple of numbers. An apple might be (7, 8, 3). A pear is (7, 6, 3). A mango is (9, 4, 4). An onion is (3, 8, 3). Plot those points in three-dimensional space and you get a scatter where apples and pears are nearly on top of each other, mangoes are off to one side in "sweet squishy" territory, and onions are in their own weird "not sweet, crunchy" corner.
That's the whole idea of an embedding. An embedding is a way of taking messy real things — words, images, sounds, users, sentences, faces — and assigning each one a point in a space, such that similar things end up near each other. Distance in the space means "how similar these things are." Direction in the space sometimes means something meaningful too, though we'll get to that.
The honest one-line definition:
An embedding is a list of numbers that locates a thing in a space where nearby means similar.
That sentence is short but load-bearing. Every word in it matters.
Why this is so powerful
The grocery-store example has three dimensions and a couple of fruits. That's cute. The reason embeddings matter is that the same trick scales, ferociously.
Modern language embeddings don't use three axes. They use hundreds, or thousands. Human hands can't draw a point in 768-dimensional space, but a computer has no problem at all holding "here's a 768-long list for every word I've ever seen." And — crucially — the computer never had to hand-pick the axes. Training a neural network on a huge pile of text produces axes automatically. Not by magic: by the same downhill walk we learned about in M3.3. The network is trained on some task — say, predicting the next word — and in the course of learning to do that task, it discovers that giving "king" and "queen" similar coordinates (with a small difference along some hidden axis) makes predictions work better. So the training process nudges the coordinates that way. Over time, the whole vocabulary settles into a rich, invisible map of meaning, and nobody ever explicitly said "make words related to royalty cluster together."
Once you have a good embedding space, an enormous number of tasks become almost trivial:
Similarity search. Want to find products like this one? Compute its embedding, compute the embedding of every other product, find the five nearest points. Done. This is how "related videos" rails on streaming platforms, "people you may know" on social networks, and "find more of this" buttons work. No bespoke logic — just distance in an embedding space.
Clustering. Want to group thousands of support tickets into themes automatically? Embed each ticket as a point, run a clustering algorithm, name the clusters. The network never needed to see the clusters during training; the embedding just turns out to be useful for that task too.
Translation and retrieval. Modern search isn't keyword matching any more. It embeds your query as a point and retrieves documents whose embeddings are close to that point. Ask "how do I fix a flat tyre" and the search finds a document titled "patching a punctured wheel" even though none of those words literally match. Meaning matched instead, because the two phrases embed to nearby points.
Recommendation. Netflix knows less about the genres of films than you might think. It knows a lot about where each film and each user sit in a shared embedding space. Recommend the films nearest to you. Retrain the embeddings as users interact. Beautifully simple and absurdly effective.
Anomaly detection. Normal transactions cluster in one region of the embedding space. A fraudulent transaction sits somewhere weird. Flag the weird ones. The model didn't need labelled fraud to learn the shape of normal.
In every one of these cases, the hard work has been quietly outsourced to the embedding. Once the embedding exists, the downstream tasks are easy. That's why every modern AI system spends its best weights and most of its training compute on building a good embedding, and then treats the end task almost as an afterthought.
The move that made people gasp
Here's the demonstration, circa 2013, that woke the world up to what embeddings could do.
Researchers at Google trained word embeddings on a huge pile of text. Each word became a point in roughly 300-dimensional space. Then they noticed something eerie. You could do arithmetic on the points.
Take the point for "king." Subtract the point for "man." Add the point for "woman." Ask which word's point is nearest to the result. "Queen."
Think about that for a second. Nobody told the network what "king" or "queen" or "man" or "woman" meant. Nobody defined gender or royalty as axes. The network was only taught to predict surrounding words in sentences from the internet. And yet, in the embedding space that fell out of training, "king minus man plus woman" landed approximately on "queen," because the direction from "man" to "woman" and the direction from "king" to "queen" turned out to be approximately the same direction — an invisible "gender axis" the network had discovered because it helped with prediction.
It worked for "Paris minus France plus Italy equals Rome." It worked for "walking minus walk plus swim equals swimming." It worked for dozens of other little analogies. The embedding wasn't just a lookup — it was a structured space where directions meant things.
The researchers didn't engineer any of this. It fell out of the training process. They discovered it. That moment — often called word2vec, after the paper that produced it — is when a lot of people in the field realised that embeddings were not a technical convenience. They were doing something closer to representing meaning itself.
Where embeddings live inside a neural network
Here's a subtle but important point: a neural network isn't a thing that uses embeddings; it's a thing that's basically made of them.
Remember the abstraction ladder from M3.2. Each layer of a network transforms its input into a new representation — a new list of numbers — that summarises the input at a slightly higher level of abstraction. At the bottom of the ladder, the input might be raw pixels or one-hot-encoded word IDs. By the middle of the ladder, each thing has become a rich numerical description: the essence of this image, the essence of this word in this sentence. By the top of the ladder, the description is tuned to whatever task the network was trained for.
Every one of those middle representations is, technically, an embedding. The deeper in the network you go, the more abstract the embedding. When people say "the model has a good internal representation of the world," what they mean is "its internal embeddings cluster meaningfully — things that should be similar end up near each other, things that should be different end up far apart, and useful directions fall out of the geometry."
When you hear talk about LLMs "understanding" things, the closest honest translation is: the LLM has embeddings where semantically similar inputs end up in similar places. It has a map. The map is enormous — tens of thousands of dimensions — but it's still a map. Every impressive thing the model does, it does by navigating that map.
Embeddings vs words, pictures, songs — they are all just points now
The final thing I want you to notice is the interchangeability.
Until you really absorb the embedding idea, pictures feel different from words, which feel different from songs, which feel different from user profiles. They're different types of data, with different rules and different tools. Embeddings make that distinction dissolve. A picture is a point. A word is a point. A song is a point. A user is a point. Once you can embed them all in the same space, you can ask questions like "which picture is closest to this word?" and get a sensible answer. That's how image captioning works ("this picture is closest to these words"). That's how text-to-image generation works ("find me an image whose embedding is close to the embedding of this prompt"). That's how multimodal search works — you type a sentence and it retrieves a video whose embedding is nearby.
The universal embedding space is the single idea that turned AI from a collection of specialised tricks into a single, roughly unified technology. Every modality — text, image, audio, video, structured data — can be embedded into the same kind of space, and once they're there, the same simple operations (nearest, similarity, arithmetic) work on all of them. That is, quietly, the biggest story of modern AI, and it's usually not even the headline.
What just changed in your head, and the shape of Module 4
You started this post with "embedding" as a slightly technical word. You're ending it with a mental image of an enormous invisible map — with points for every word, every image, every song, every user, every concept your AI system cares about — and the understanding that most of what a neural network does is build and navigate that map.
A sentence worth carrying into every remaining post of this course:
Modern AI works because it turns everything — words, pictures, sounds, users, intentions — into points in a space where nearness means similarity and directions mean relationships.
Hold onto that. It's about to explain half of what happens in Module 4, when we meet the transformer — the architecture behind every modern large language model. Transformers are, at their heart, industrial-strength embedding builders. They take a sequence of words, let every word look at every other word through a mechanism called attention, and transform the whole sequence into a stack of richer and richer embeddings, layer by layer. What RNNs couldn't do — cleanly relate distant parts of a sequence — transformers handle as their main move.
We've now covered everything we need to make that story make sense. The neuron, the ladder, the downhill walk, CNNs for images, RNNs for sequences, and embeddings everywhere. Module 3 is done. The machinery is all in your head.
Module 4 is going to take that machinery and show you how all of it got stitched into the single architecture that powers ChatGPT, Claude, Gemini, and everything else in the current wave. No calculus. No code. Just one more metaphor at a time.
See you on the main road.
Course navigation
| ⬅️ Previous | 📍 You are here | Next ➡️ |
| ⬅️ Previous M3.5 · RNNs — One Word at a Time | M3.6 | Next ➡️ M4.1 · Attention, Without Equations |
📚 AI Zero to Hero · Course Home — all 33 posts, six modules.
Cover photo via Unsplash. This post is part of the AI Zero to Hero series.