Skip to main content

Command Palette

Search for a command to run...

Tokens, Context Windows, and Why Length Matters

Large language models do not read words. They read tokens — chunks of text that are often weirder than you'd expect. And the length of what they can hold in view at once is the hidden cost behind every modern AI bill.

Updated
11 min read
Tokens, Context Windows, and Why Length Matters

Two of the words you'll hear most often in any AI conversation — and two of the words people most confidently misuse — are token and context window. "GPT-4 has a 128k context window." "Claude supports 200k tokens." "This prompt will cost you a thousand tokens." Everyone nods, and roughly nobody knows what any of those sentences actually means.

The underlying ideas are simple. Both of them, actually. You'll have working intuitions for both by the end of this post. But they come with a pile of small weirdnesses that are worth knowing about, because those weirdnesses explain a lot of otherwise-mysterious LLM behaviour — including why the model forgot what you said three messages ago and why your bill from the API doubles when you paste in a long document.

No calculus. No code. One scroll and one paper ruler.


Why models read in chunks, not words

When you type a sentence into a language model, the first thing the model does is not read your sentence word by word. It chops your text into little pieces called tokens, looks each token up in a big table of learned embeddings, and feeds the resulting vectors into the transformer we met in the last post.

The obvious question: why chop? Why not just use whole words?

Two reasons, both practical.

Reason 1 · the vocabulary would be huge. If every distinct word got its own token, the model's vocabulary would have to include every word in every language, every proper noun, every misspelling, every slang coinage, every brand name. That's many millions of entries. Every one would need its own embedding vector to learn, and rare words would barely get enough training examples to learn anything useful about them.

Reason 2 · it misses relationships between related words. "Run," "running," "runs," "runner" all share a meaning but are different words. With word-level tokens, the model has to learn each one from scratch. With chunked tokens, the shared chunk "run" shows up inside all of them, and whatever the model learns about "run" transfers naturally to "runner."

The compromise is subword tokenization. Instead of words, the model's vocabulary is a carefully chosen set of chunks — maybe 50,000 to 100,000 of them — where common English words get their own token, less-common words get broken into a few sub-pieces, and truly weird inputs get broken down further still. The chunks are learned from a huge corpus of text using a clever algorithm called byte-pair encoding (BPE) or one of its cousins. You do not need to know how BPE works. You need to know what it produces.

Here's what tokenization actually looks like on a handful of real examples.

A common English word like "cat" is usually one token. A long but common word like "internationalization" might be two tokens — roughly "international" plus "ization." A nonsense word you made up, like "llmz," gets chopped into smaller pieces the tokenizer can find in its vocabulary. An emoji, which isn't any kind of word, gets represented at the byte level — sometimes three tokens for a single symbol.

Two consequences of this are worth keeping in your head:

  1. Token count does not equal word count. For English, a rough rule of thumb is "1 token is about 3/4 of a word." So 1,000 tokens is roughly 750 English words, or a page and a half of a book. For code, the ratio is worse (code tokenizes less efficiently). For non-English languages, sometimes dramatically worse.

  2. The same text, tokenized by different models, produces different token counts. Every model has its own tokenizer. GPT-4's tokenizer, Claude's tokenizer, and Llama's tokenizer all chop text into different pieces. This matters for bills (you pay per token, and token counts differ between vendors) and for behaviour (a model trained with one tokenizer has slightly different strengths than one trained with another).

The honest one-liner:

A token is a chunk of text, usually between one character and one word, that the model treats as a single atom. The model never sees raw text — only a list of token IDs.

Good. That's half the post. On to the context window.


The context window is a paper ruler

Now the second concept. When the model runs, it loads all the tokens of your current input — the system prompt, the conversation history, your latest question, any attached documents — onto a kind of scroll, and then the transformer reads that entire scroll to generate the next token. The maximum length of the scroll is called the context window, and it's always measured in tokens, not words.

Everything in that list competes for space on the same scroll. If the scroll is 8,000 tokens long and the system prompt is 1,500 tokens and the conversation history is 4,000 tokens and the document you pasted is 3,000 tokens, you are already 500 tokens over budget before the model has said a word, and the application has to throw something out. Usually the earliest conversation turns. That's why, in long chats, the model eventually "forgets" what you said an hour ago — it was quite literally pushed off the end of the scroll.

The context window is also where the "Claude has a 200k context window" headlines come from. That number is the size of the scroll. A 200k-token window is roughly a 500-page book you can hand to the model and ask questions about. An 8k-token window is a long blog post. A 2k-token window, common in older models, is about three pages.

Two more things to know about context windows.

1 · Bigger windows are disproportionately expensive. Recall that self-attention lets every token look at every other token. With 1,000 tokens, every token has 999 comparisons to make — about a million total. With 10,000 tokens, every token has 9,999 comparisons — about 100 million. So the cost of attention scales with the square of the window length. Doubling the window quadruples the compute. This is why 200k-token context windows are a technological achievement, not just a configuration setting, and why using them is noticeably slower and more expensive than short prompts. A lot of active research is about making attention cheaper at long contexts — sliding windows, sparse attention, linear attention, and more — but the default quadratic cost is the thing most long-context products are still quietly paying for.

2 · Bigger windows do not automatically mean better recall. A model with a 200k context window can technically read all 200k tokens, but in practice, models tend to remember what's at the very start and the very end of the scroll much better than what's in the middle. This is called the lost-in-the-middle effect, and it's why stuffing a huge document into a prompt and asking a question sometimes works brilliantly and sometimes produces an answer that confidently ignores a key sentence on page 47. Think of it like listening to someone read a long email out loud — you remember the opening and the ask, and the middle is a blur. LLMs are surprisingly human about this.

The one-liner for context window:

A context window is the fixed-length scroll of tokens the model can see at once — including the prompt, the conversation, attached files, and the answer-in-progress. Anything off the scroll might as well not exist.


Why this explains so much LLM weirdness

Once you have tokens and context windows in your head, a lot of AI behaviour that previously looked arbitrary starts making sense.

"Why did the model forget what I said at the start of our conversation?" Because you have been chatting long enough that the earliest turns rolled off the end of the scroll. Your words didn't get stored in a database somewhere — they were in the scroll, and the scroll has finite length.

"Why is this API call twice as expensive as the previous one?" Probably because you attached a long document and the prompt is now twice as many tokens. You're being billed per token on both the input and the output.

"Why does the model sometimes count characters wrong?" Because the model does not see characters. It sees tokens. When you ask "how many letters are in 'strawberry'?", the model has to work indirectly through token boundaries that don't cleanly map to letters. The famous "how many r's are in strawberry" confusion is almost entirely a tokenization artefact.

"Why does the model repeat itself after enough time?" Because the scroll only has room for so much recent context. Past a point, the model loses track of which things it already said and circles back.

"Why do long documents sometimes get ignored in the middle?" Lost-in-the-middle. The model's attention, despite allowing every token to look at every other, tends to emphasise the ends of the scroll at the expense of the middle. Nobody fully understands why. Research is ongoing.

"Why are some languages so much more expensive?" Because tokenizers trained primarily on English give rare characters far more tokens. A Japanese sentence and an English sentence with the same meaning might use 3x more tokens in Japanese, costing the user 3x more.

None of these are bugs. They're all direct consequences of the architecture we just built up. Once you see the scroll, you see why.


A practical mental model for using LLMs

Two quick habits that fall out of knowing this material.

Budget tokens, not words. When you're writing a system prompt for an application, ballpark-estimate it in tokens (about 1.3 tokens per word for English). Then leave real budget for conversation history and the answer. If you're inside an 8k window and your system prompt is already 5k tokens, you're going to get a forgetful, slow assistant.

Put the most important thing early or late. If you're pasting a long document in and asking a question about it, put your question before and after the document. Models attend best to the edges. "Question. [Very long document]. Remember, the question is: ..." is a surprisingly effective pattern for beating lost-in-the-middle.

A third habit, for paying customers: pre-trim aggressively. If you're passing an LLM old emails, meeting notes, database rows — trim or summarise before you pass them in. Every token you save is a token of saved latency, saved dollars, and saved risk of lost-in-the-middle. The smartest LLM integrations I've seen spend almost as much effort on "what to put in the context window" as they do on "what to ask the model." The former is the bottleneck more often than the latter.


What just changed in your head

You started this post with "token" and "context window" as jargon you'd seen in a sales email. You're ending it with a working picture of both: a token is a chunk of text the model treats as a single atom, and the context window is a fixed-length scroll of tokens the model can see at once, including everything in the prompt and the room it needs to write back.

That sentence explains half the strange behaviour you'll ever see from an LLM. When the model forgets, it's the scroll. When the bill doubles, it's the scroll. When the middle of a long document gets ignored, it's the scroll. When the model can't count letters in a word, it's the tokens. Most mysteries have prosaic explanations if you're willing to look at the scroll.

One sentence to carry forward:

Everything an LLM knows about your current situation fits on a finite scroll of tokens. Whatever is on the scroll, it can use. Whatever is not, it cannot.

In the next post, we switch from what the model reads to how the model got smart in the first place. Modern LLMs are trained in three distinct phases — pretraining, finetuning, and RLHF — and each phase does a very different job. Pretraining is when the model reads the internet. Finetuning is when it practices specific tasks. RLHF is when a human tutor teaches it to be helpful. Understanding these three acts is understanding where every character trait of your favorite chatbot came from.


Course navigation

⬅️ Previous📍 You are hereNext ➡️
⬅️ Previous
M4.2 · The Transformer, Piece by Piece
M4.3Next ➡️
M4.4 · Pretraining, Finetuning, RLHF

📚 AI Zero to Hero · Course Home — all 33 posts, six modules.


Cover photo via Unsplash. This post is part of the AI Zero to Hero series.

More from this blog

Learn AI - Zero to Hero

111 posts