Skip to main content

Command Palette

Search for a command to run...

The Transformer, Piece by Piece

Attention is the engine. The transformer is the whole vehicle. Here is what's inside one transformer block, in plain English, and why stacking dozens of them is the entire modern AI wave.

Updated
11 min read
The Transformer, Piece by Piece

In the last post we met attention: every word in a sentence walking into a library, glancing at every other word, and copying the relevant content into its own notebook. That's the core move. But a transformer is not just attention. If attention is an engine, a transformer is the whole vehicle built around it — with a chassis, a cooling system, and a surprising amount of tape holding things together.

In this post we're going to open the hood. We'll look at what one transformer block actually contains, in plain English, and then we'll stack the blocks and see how the whole architecture turns "read a sentence" into "generate the next word." By the end, you'll be able to read any diagram of a transformer and roughly know what every box does. No calculus. No code. One block, then a stack.


What a transformer really is, in one sentence

Before we zoom into a single block, here's the elevator version of the whole thing:

A transformer is a stack of identical blocks, each containing an attention layer and a small feed-forward network, where the input is a list of word vectors and the output is a transformed list of word vectors that know more about each other than they did at the start.

That sentence is the whole architecture. We'll unpack every word of it. But hold it in your head — if you lose track of where we are in the diagram, come back to this sentence.

Two things before we dive in.

First, remember from M3.6 that every word is really a point in a high-dimensional space — an embedding. Inside a transformer, when we say "word," we almost always mean "the word's current embedding vector." The transformer's job is to keep updating those vectors, layer by layer, so that each vector captures more and more about the word's role in this specific sentence.

Second, the transformer doesn't see a sentence as a single chunk of text. It sees it as a list — the first word's vector, then the second word's vector, then the third. That list is what flows into the bottom of the transformer and out of the top. Everything in between is manipulation of that list.

Okay. The block.


Inside one transformer block

Here's what you'll find if you crack open one transformer block. Don't worry about the boxes yet — we'll go through them one at a time.

One block takes a list of vectors in, runs two main operations, and spits a transformed list out. Everything inside that block is designed to improve each vector's understanding of its own role in the sentence.

Let's walk the block top to bottom.

Step 1 · self-attention. Every vector in the input list runs the library trip from the last post. Every word asks its query, glances at every other word's key, gets relevance scores, and mixes together the values of the most relevant words. After self-attention, every vector carries a little bit of the information it needed from the rest of the sentence. This is the step that makes the transformer context-aware.

Step 2 · residual add. Instead of replacing the original vector with the attention result, the block adds the attention result back into the original vector. It's more like "here's what you were before, plus a bit of what you just learned from the others." This little move — a residual connection — is borrowed from an earlier architecture called ResNet and it's one of the most load-bearing tricks in deep learning. It's the reason you can stack dozens of blocks without the signal getting lost. (More on why, in a second.)

Step 3 · feed-forward layer. This is the part people tend to forget about, but it does half the work of a transformer. After attention has shuffled information between words, each word's vector is passed through a small, private neural network — the same one for every word, but applied to each vector independently. No attention here. No looking at neighbours. Just each word getting one pass through a regular two-layer neural network that updates it. You can think of this step as each word digesting whatever it just learned from attention, and turning that raw information into something more useful.

Step 4 · another residual add. The output of the feed-forward is added back to its own input, same trick as before.

The result is a list of vectors that has been through one round of "look around, then think about what you saw." That's it. One transformer block. No other moving parts.

Two details I'm glossing over on purpose: layer normalization (a stabilisation trick applied before each sub-step) and multi-head attention (the multiple-librarians idea from the last post). These are important to get a transformer working, but they don't change the story of what a block is for. If someone shows you a more detailed diagram later with "LayerNorm" boxes everywhere, they're pointing at the stabilisation tape, not a new function.

Here's the one-line definition of a block:

A transformer block is one round of "every word looks around and then thinks about what it found," wrapped in residual adds that let the signal survive many stacked blocks.


Why the residual trick is quietly the hero

Take a second on this one, because it's not obvious.

Imagine you're trying to train a 96-layer deep network where each layer substantially rewrites its input. Over 96 layers of rewriting, the original signal — the meaning of the words you fed in — can get smeared, distorted, or lost completely. Training struggles because the gradients have to flow backwards through 96 layers of transformation, and each layer is another opportunity for the signal to die.

The residual trick says: at each layer, don't rewrite the input. Add to it. The vector that leaves each block is (the vector that came in) plus (a correction computed from the attention and feed-forward). This means the original signal is still there, all the way up the stack, and the deeper layers are only ever computing small corrections on top of what the earlier layers produced. Gradients flow backwards through the sum cleanly because one of the two addends is just a straight copy.

Without residuals, transformers would plateau at maybe a dozen layers. With residuals, people routinely train networks that are 96 layers deep, 120 layers deep, even more. Almost every modern deep network has some version of this trick in it, and the ones that don't are almost always smaller and worse. It's the difference between building a house out of mud (smears with every new layer) and building it out of Lego (each new brick clips onto the previous one without losing anything).

Hold onto this: residuals turn "adding a layer" from a risk into a no-cost upgrade. If a new layer doesn't help, its correction just averages to zero and the signal passes through untouched. If it does help, you get its help for free. That asymmetry — downside small, upside big — is why stacking layers works at all in very deep networks.


The whole transformer, stacked

Now let's zoom out. One block is a single round of look-and-think. A full transformer is just many of those blocks stacked on top of each other, usually between six and ninety-six of them. Here's the whole thing.

Four stages, in order:

Stage 1 · embed. The raw text is split into tokens (we'll see how in the next post), and every token is turned into a vector. These starting vectors don't know anything about context — they're just "here's the word 'bank', here's a generic vector for 'bank'." Whether this is the river-bank or the money-bank will get figured out later.

Stage 2 · stacked blocks. The list of vectors flows up through block after block. Each block lets every word look around at every other word and then digest what it found. Early blocks tend to pick up simple contextual tweaks — "this 'bank' is next to 'river', maybe it's the river kind." Middle blocks build more abstract features. Late blocks are often specialised for the task — "given everything this sentence means, what word comes next?" Nobody hand-designs this ladder of specialisation. It emerges from training, just like the abstraction ladder in M3.2.

Stage 3 · output head. After the final block, every word has a vector that's been thoroughly marinated in context. For most modern language models, the thing we actually want is a prediction for the next word. So a small final network takes the last word's vector and turns it into a probability score for every possible next word. This final network is called the output head or the prediction head.

Stage 4 · pick a word. You sample a word from that probability distribution, append it to the sentence, and run the whole transformer again — now with one more word in the input. That's how language models generate text: one token at a time, looping back through the whole stack for each new word.

There's a specific phrase for "predict the next word, one token at a time, feeding each prediction back as new input": autoregressive generation. Every chatbot you've used works this way under the hood. The model doesn't plan a whole response in advance. It just keeps writing one word at a time, conditioning each new word on everything before it.


Encoders, decoders, and why modern LLMs are "decoder-only"

Brief sidebar. The original transformer paper actually described a pair of stacks — an encoder to read the input and a decoder to produce the output. That was because the original task was translation (read English, write French). The encoder would turn an English sentence into a stack of contextualised vectors, and the decoder would then produce French words one at a time, with its own attention layers that looked at both the encoder's output and the partial French translation so far.

For a few years, this encoder-decoder split was the standard. Google's early translation models and the BERT family of text-understanding models used it.

Then, around GPT-2 and GPT-3, OpenAI pushed a simpler idea: what if we just had a decoder, trained to predict the next word on a huge pile of text, and used that for everything? No separate encoder, no translation framing, just "predict the next word" at massive scale. It turned out you could use such a model for summarisation, translation, question answering, code generation, creative writing — anything — by just giving it the right prompt and letting it generate.

That's why most big language models you've heard of (GPT-4, Claude, Gemini, Llama) are decoder-only transformers. They're a stack of the block we just drew, trained on next-word prediction, used for everything. The architectural simplicity of that move is one of the most consequential decisions in AI history.

If you see someone say "encoder-decoder" or "decoder-only," that's what they mean. For the rest of this course, assume "transformer" means "decoder-only stack of blocks trained on next-word prediction" unless I say otherwise.


What just changed in your head

You started this post knowing about attention but not about what it sits inside. You're ending it with a picture of a block that does two things — look around, then think — wrapped in residual connections that let the block be stacked dozens of times without losing the signal. Stack the blocks, feed text in the bottom, read predictions off the top. That's the whole architecture that runs modern AI.

One sentence to carry forward:

A transformer is a tall stack of identical blocks, each one letting every word look at every other word and then updating itself, with residual connections that make the stack deep-trainable.

That's your whole working model of a transformer. When we talk about "a 70-billion parameter model" or "a 96-layer transformer" or "the model attended to the earlier context," you can now picture what's actually going on.

In the next post, we handle two words that come up constantly in AI conversations and almost never get properly explained: tokens and the context window. They're simple ideas, but they have big implications — including why models sometimes forget what you said three messages ago, and why longer context windows cost exponentially more money. No calculus. No code. One scroll.


Course navigation

⬅️ Previous📍 You are hereNext ➡️
⬅️ Previous
M4.1 · Attention, Without Equations
M4.2Next ➡️
M4.3 · Tokens and Context Windows

📚 AI Zero to Hero · Course Home — all 33 posts, six modules.


Cover photo via Unsplash. This post is part of the AI Zero to Hero series.

More from this blog

Learn AI - Zero to Hero

111 posts