Skip to main content

Command Palette

Search for a command to run...

RNNs — Reading One Word at a Time

Before transformers, this was the only serious way to handle sentences, speech, and anything else that arrives one thing at a time. RNNs are the detour that makes the main road make sense.

Updated
10 min read
RNNs — Reading One Word at a Time

Try this. Read the following sentence out loud, one word at a time, and at each word make a note of what your best guess is for the word that comes next:

The chef carefully pulled the roasting pan out of the …

You didn't wait until the end of the sentence to form a guess. You started guessing right after "The." At "chef," your prediction sharpened: probably something food-related. At "roasting pan," it sharpened further: almost certainly "oven." By the time the sentence actually ends, you're barely even reading — you're confirming.

Notice what your brain is doing. At every word, you're holding a little bundle of state — a mental summary of the sentence so far — and using it to predict what comes next. You update the bundle with each new word, throw it forward to the next step, and repeat. You don't re-read the whole sentence every time a new word shows up. You just carry forward a running summary.

That's a recurrent neural network. RNN for short. Not scary. Not mysterious. Just a neural network with a notebook that it takes with itself from one word to the next. In this post, we'll see exactly what that notebook looks like, why RNNs mattered, and why they eventually hit a wall that a completely different architecture had to climb over. No calculus. No code. One notebook.


The problem CNNs couldn't solve

In the last post, convolutional networks cracked images open with a sliding flashlight. They're stunningly good at data where local patterns repeat everywhere. Fewer pixels than the problem suggests, the same patterns searched across every position, layers of patterns building into objects. Beautiful.

Now try applying that idea to a sentence.

A sentence is a sequence of words. Words have an order. "The dog bit the man" is not the same as "The man bit the dog." A flashlight that slides through the sentence a few words at a time can see local patterns — pairs of words, short phrases — but it cannot easily connect "the" at the start with the verb at the end, because by the time the flashlight has slid that far, the earlier words are out of view. And more importantly, sequences have memory. Understanding the fifth word often depends on what the first word told you.

CNNs don't have a great notion of memory. They weren't built for it. For text, speech, music, time-series — anything that arrives one thing at a time and needs continuity — you need a different shape of network. One with a notebook.


A neural network that takes notes

Here's the whole RNN idea in one picture.

At every step, the RNN cell — which is really just a small neural network we've already met, some weighted sums and squishes — takes two inputs. The first is the current word. The second is the notebook from the previous step. It combines them, does its little committee thing, and produces two outputs: a new notebook to hand off to the next step, and maybe a prediction for what the current step should say.

The notebook has a technical name — hidden state — but you can forget that immediately. It's a notebook. A list of numbers the network uses to remember what's happened so far. At step 1, it's initialised to zeros (blank paper). At step 2, it carries a summary of what step 1 saw. At step 10, it carries a summary of everything up to step 9. By design, it never sees the original words directly after they pass — only the evolving summary in the notebook.

Here's the honest one-line definition:

An RNN is a neural network that runs the same small network at every position in a sequence, carrying a notebook of state from one position to the next.

Three things to notice about this shape:

  1. The same weights are used at every step. Just like a CNN shares its filter across every image position, an RNN shares its little cell across every sequence position. That keeps the number of weights manageable no matter how long the sentence gets.
  2. There's a direction. Unlike a CNN, which can scan anywhere, an RNN is reading left to right (or right to left, or both, but always in some order). It knows what "before" and "after" mean.
  3. The notebook is the only memory. If something important happened at word 2 and the network needs it at word 50, the only way the information can get there is by surviving in the notebook across 48 update steps. This will turn out to be the critical weakness.

What RNNs were genuinely good at

From about 2014 to 2018, RNNs — especially a fancier version called the LSTM (long short-term memory) — were the state-of-the-art for almost every sequence problem.

Machine translation. The original Google Translate neural models were built on LSTMs. You'd feed an English sentence in, the RNN would build up a notebook summarising the whole thing, and then another RNN would use that notebook as its starting state and produce the French translation, one word at a time. For the first time, machine translation didn't sound like a drunk tourist; it sounded like a competent non-native speaker. That was a huge leap and it was RNNs that delivered it.

Speech recognition. Your voice assistant from that era was almost certainly running an RNN under the hood. Audio is a very long sequence of tiny samples. An RNN that reads samples one tiny chunk at a time, carrying a notebook, was the best practical approach to turning sound waves into words.

Text generation. Before large language models, if you wanted a neural network to produce text — autocomplete, write summaries, generate captions — you used an RNN. The trained network would take a starting word, produce a notebook, predict the next word, feed that prediction back in as the next input, update the notebook, predict another word, and so on. You could train an RNN on a year's worth of New York Times articles and get it to generate plausible-ish paragraphs. At the time, this felt like magic.

Music and time series. Forecasting stock prices, generating drum beats, predicting what button a user would press next. Anywhere a thing was unfolding over time, RNNs were often the first honest approach.

For a few years, if you wanted a model that understood sequences, you reached for an RNN. It was the dominant architecture for that kind of data.


Where the notebook gets torn

Then, around 2017-2018, RNNs started getting quietly replaced for almost every serious problem. It wasn't that the architecture was wrong; it was that it had two specific, painful weaknesses that got harder to ignore as problems scaled up.

Weakness 1 · long-range memory is rickety. Remember, the only way information from word 2 can influence what the network does at word 200 is if it survives in the notebook across 198 update steps. In theory, a well-trained network could carefully preserve important information across long stretches. In practice, the notebook gets overwritten, smeared, and gradually washed out. Information from far in the past tends to fade unless the network has explicitly learned to protect it — and that learning is hard. The famous "vanishing gradient" problem is the technical name for this: the training signal from the far end of the sequence struggles to reach the weights that processed the early part, so those weights never get updated well. LSTMs were specifically designed to help with this, and they did help, but they didn't fully fix it. For sentences of a few dozen words, RNNs worked. For documents of a few thousand words, they strained. For genuinely long contexts, they mostly failed.

Weakness 2 · they cannot be parallelized. Because the notebook at step 10 depends on the notebook at step 9, which depends on the notebook at step 8, and so on, you have to process the sequence one step at a time. You cannot compute step 10 and step 11 at the same time; you have to wait for 10 to finish before you can start 11. In an era where training speed was dominated by how well you could use GPUs — which are massively parallel machines — RNNs couldn't take advantage of most of the hardware. Training was slow and scaling was limited in a way CNNs never were.

Add those two weaknesses together and you get a ceiling. You can build pretty good RNN models for sentence-length problems, but you can't easily scale them to book-length problems without running out of money or breaking the memory. For a while, the field accepted this and built fancier LSTMs and other variants. Then, in 2017, a paper called "Attention Is All You Need" came out, and the ceiling turned out to be lower than anyone realised. That paper introduced the transformer, which solved both problems simultaneously — long-range memory via an attention mechanism, and full parallelism by dropping the sequential notebook entirely. Within a year, every serious sequence model was a transformer. Within three years, RNNs were largely historical.

We'll spend all of Module 4 on the transformer. For now, you just need to know why the replacement happened.


Why RNNs still matter as a mental model

You might be wondering: if RNNs are dead, why this whole post?

Two reasons. First, the "notebook of state that updates at every step" idea is one of the cleanest ways to understand what "processing a sequence" means. Even though modern transformers do not work this way literally, the intuition of "read the sentence word by word, carry forward a running summary, use that summary to predict" is a fantastic mental model for what any sequence processor is trying to accomplish.

Second, RNNs clarify what transformers had to fix. When you see an attention mechanism in Module 4 and read that it lets every word in a sentence "look at" every other word directly, the reason that's a big deal is because RNNs couldn't do that. Every word in an RNN had to route its information through the notebook, one step at a time, losing fidelity along the way. Attention is a direct response to that pain. Without knowing the pain, the solution looks like a clever party trick. With it, the solution looks like someone finally kicking a door down.

So RNNs are the detour that makes the main road make sense. You don't need to remember their training tricks or their gate equations. You need to remember the shape: a network with a notebook, reading sequences one step at a time, which was once the best we had and which is now a fossil.


What just changed in your head

You started this post knowing that neural networks handle images pretty well. You're ending it knowing how they used to handle sequences, and why that approach couldn't quite make it to the end. RNNs are the architecture that proved sequence learning was possible and simultaneously proved that the straightforward way to do it wouldn't scale. Every frustration RNNs caused became a requirement the transformer had to meet.

One sentence worth carrying forward:

An RNN reads a sequence one step at a time, carries a notebook of state from step to step, and is limited by how much the notebook can remember across distance.

That's the whole thing. That sentence is the context you need to appreciate what's coming in Module 4, which is nothing less than the single most consequential architecture change in AI history.

In the next post — the last post of this module — we take a step back from architectures and talk about a single idea that shows up inside all of them: the embedding. We've hinted at it twice already, in M2.3 and in our descriptions of neural network layers. Now we're going to meet it head on. It is, I will argue, the single most important idea in modern AI that isn't "neural network," and it shows up everywhere once you know the name.


Course navigation

⬅️ Previous📍 You are hereNext ➡️
⬅️ Previous
M3.4 · CNNs — With a Flashlight
M3.5Next ➡️
M3.6 · Embeddings — Everything Is a Point

📚 AI Zero to Hero · Course Home — all 33 posts, six modules.


Cover photo via Unsplash. This post is part of the AI Zero to Hero series.

More from this blog

Learn AI - Zero to Hero

111 posts