The LLM Is a Function, Not a Friend

Here's the bug that changed how I write LLM code.

I shipped a feature that used Claude to summarise support tickets. It worked beautifully in dev. On day three in prod, a teammate pinged me: "Hey, I sent in the same ticket twice and got two completely different summaries. Is that a bug?"

It wasn't a bug. It was me forgetting something I'd known for months: an LLM API call is a function, not a conversation. And a function that's stochastic by default will — shockingly — give you different outputs for the same input.

This post is the first of a new course for developers: AI for Builders. If you finished AI Zero to Hero and have that smooth "I know what a transformer is" feeling, and you've opened your editor and immediately thought now what do I actually type, this is the course for you. Twenty-eight posts of me showing you what to type, why, and — most importantly — what breaks.

We start with the most load-bearing mental model of all, because if you skip it, every other decision in this course will feel like guesswork.

The model: `f(prompt, settings) -> text`

Here is the signature of every LLM API call you will ever make, written in the plainest Python imaginable:

def llm(prompt: str, settings: dict) -> str:
    ...

That's it. Every OpenAI call, every Anthropic call, every Gemini call, every call to a local model running in llama.cpp — they're all instances of this function. The arguments get richer (you'll see tools, system prompts, images, streams), and the return type gets richer (you'll see token counts, stop reasons, tool calls), but the shape doesn't change. You put bytes in, you get bytes out, the process is stateless, and that is your whole mental model.

Every time your intuition is about to drift — and it will drift, because chatbot UIs have trained all of us to think of LLMs as entities — come back to this signature. It is a function. It does not remember you between calls. It does not have a mood. It has inputs, settings, and an output.

Let's make that concrete. Here is what an actual call looks like.

# pip install anthropic
import os
from anthropic import Anthropic

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=200,
    messages=[{"role": "user", "content": "Write a haiku about latency."}],
)

print(response.content[0].text)

Save that as demo.py, set ANTHROPIC_API_KEY in your shell, run it, and you will see a haiku. Run it again, and you will see a different haiku. That's not a bug. That's the function doing its job. The default settings include some randomness, and randomness is the point — if it didn't have any, the same prompt would always produce the same output, which sounds appealing until you realise you'd get the same summary, the same email draft, the same joke, every single time.

Before we go further, let me draw the whole picture of what's happening in that call. Because as soon as you hold the shape in your head, the rest of this course becomes much easier.

Anatomy of a call

Five boxes. Two of them are your code. The other three are, for the purposes of this course, a black box — you won't operate them, you will call them. Every capability, every failure mode, every cost line in your bill comes from something inside those three middle boxes, and your job as a builder is to reason about them from the outside without ever opening them.

The important thing to notice is that there is no arrow going back from any middle box to your code except the response. There's no side channel. No hidden memory. No "state" that survives between calls. Every time you call the API, the provider spins up the model on your input from a cold start and gives you back a response. If you want the model to "remember" something from a previous call, you have to put it in the next prompt. That's it. That's how memory works.

This is worth repeating because beginners get it wrong constantly: the LLM does not remember the previous request unless you remind it. When you use ChatGPT and it remembers what you said earlier, that's the chat UI storing the conversation and re-sending all of it on every new turn. The model itself has no memory of you at all. It has a context window, and the context window is whatever your code put in it this call.

The three settings that change everything

Every LLM API has dozens of knobs. Most of them you will never touch. But three of them will shape the behaviour of your function more than anything else you do. Learn them first.

1 · the model

The model argument decides which function you are calling. claude-sonnet-4-6 is a different function than claude-haiku-4-5-20251001, with different strengths, latency, and cost. Switching models is the single biggest lever you have. In this course we will mostly use recent frontier models — Claude Sonnet, GPT-5-mini — and call out when a cheaper or faster model would be a better pick. The rule of thumb: start with the capable model, measure quality, then try to downshift to a cheaper one and see if quality holds. The reverse — starting with the cheap model and trying to crank up the prompt — wastes time in 80% of cases.

2 · `max_tokens`

This caps how much the function can output. It is not a quality knob. It is a "how long do I let this run" knob. Set it low (say 200) and the model will happily cut off mid-sentence. Set it absurdly high and you pay for tokens you didn't need. For most production calls, set it to just above the most the function should ever need to produce. A one-paragraph summary? max_tokens=300 is fine. A code refactor? Maybe max_tokens=2000. Getting this number wrong is the most common reason a call returns a truncated garbage output, and the failure mode is sneaky because the model doesn't know it got cut off.

3 · `temperature`

The one everyone has heard of, and the one most people misunderstand. Temperature controls how randomly the model samples its next token. temperature=0 means "pick the most likely token every time." Higher values mean "sometimes pick less likely tokens, for variety." The standard defaults are around 0.7 or 1.0, which is why running the same prompt twice gives you different haikus.

Here is the bit nobody tells you: temperature=0 is not deterministic in production. The math says it should be, but in practice, floating-point nondeterminism on the provider's GPUs means you'll still get occasional variation even at zero. Close to deterministic, but not quite. Don't build reliability on it. If you actually need bit-exact repeatability, you need a seed parameter on top of temperature=0, and even that is best-effort on most providers. For most of your code, pick:

temperature=0 for extraction, classification, anything where there's a "right" answer
temperature=0.7 for drafting, brainstorming, anything where variety is good
Anywhere between for a mix

Now watch what happens when the same call runs twice at the default temperature.

# pip install anthropic
import os
from anthropic import Anthropic

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

def ask(prompt):
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=100,
        messages=[{"role": "user", "content": prompt}],
    )
    return response.content[0].text

print(ask("Name a color."))
print(ask("Name a color."))
print(ask("Name a color."))

Three runs, three different colours. That is not a bug in the model and it is not a bug in the SDK. It is the function doing exactly what you asked it to do, which is to sample from a probability distribution over the vocabulary and return a name. If this surprises you, you are importing a mental model from classical software where the same input produces the same output, and LLMs are not classical software.

The two consequences you'll feel every day

Holding "LLM is a function" as the core mental model has two immediate, practical consequences for how you write code. They sound obvious when you read them. They will still trip you up for the first month.

1 · idempotency does not come for free

In a normal web service, calling an endpoint twice with the same input gives you the same response, and if you shove a result into a cache, the cache is trivially useful. With an LLM, neither of those is true by default. You called it twice; you got two different answers. Your cache will return the first answer for the next user who sends the same prompt, which might be the right behaviour, or it might be a subtle bug where two users get the same hallucinated citation because you were trying to save money.

The fix is not to abandon caching. The fix is to think about which calls you want to be idempotent and which you want to be fresh, and to cache the idempotent ones explicitly with the exact-input-equals-exact-output contract. We'll build a proper cache later in the course (M5 post 21). For now, know that idempotency is a design choice, not a default.

2 · conversation is a lie the chat UI is telling you

When you use client.messages.create with a list of messages, you're not having a conversation with the model. You are sending the entire conversation so far in a single call. The model has no memory of the previous turn. Your chat app is just keeping a list of messages in memory, appending each new one, and resending the whole thing every time.

history = [
    {"role": "user", "content": "My name is Priya."},
    {"role": "assistant", "content": "Nice to meet you, Priya."},
    {"role": "user", "content": "What's my name?"},
]
# This works. The model sees the whole history and answers "Priya."

That second user turn works because the first turn is still in the list. Delete the first two messages, and the third turn can't possibly know the answer, because there is no hidden state anywhere. Every call is a fresh function invocation that looks at exactly the messages you gave it, and nothing else. Your app is the memory. Always.

This mental model makes the whole rest of the course easier. "How does RAG work?" — it puts extra context into the next function call. "How does an agent work?" — it runs the function in a loop, feeding its output back as the next input. "How does fine-tuning work?" — it trains a new version of the function with different weights. All of these are variations on the same shape.

Admit what breaks

Every post in this course has a failure-modes section, because if I tell you an approach works without telling you how it fails, I am setting you up to learn the hard way. Here is what breaks on your first week of this mental model:

Forgetting to pass state. You will write code that "works" in a single-turn script and ship it, and then it will silently break when you try to make it multi-turn, because you never wired up the history. The fix is to always build with the history list even if your demo only uses one turn. Start as you mean to go on.
Relying on deterministic outputs that aren't. You'll set temperature=0, see the same output three times in a row, and call it done. Then on day five of production, it will occasionally drift and your downstream code will crash. Never build parsing code that assumes a specific exact wording. Instead, use structured outputs (coming up in B1.3) or a tolerant parser.
Blowing through max_tokens without noticing. Model cuts off mid-sentence, your json.loads(response.content[0].text) throws JSONDecodeError, you spend an hour on the wrong theory. Always check response.stop_reason. If it's "max_tokens", you got truncated. Treat that as a real error, not a soft warning.
Trusting the "context window" as though it were infinite. Even with a 200,000-token window, stuffing more into it rarely makes the output better past a certain point — and there's a real "lost in the middle" problem where the model under-weights content in the centre of a long context. Long context is a budget to spend wisely, not a fridge to fill.
Thinking temperature changes how "smart" the model is. It doesn't. It changes how much randomness is in the sampling. A lower temperature doesn't make it smarter on hard questions, it just makes it commit more confidently to its most likely answer — which could be wrong. Don't use temperature as a quality dial.

None of these are catastrophic. All of them are real. You will hit at least three of them in your first week building with LLMs, and every hit is a small, embarrassing learning experience. Save yourself one by reading this list.

What just changed in your code

If the mental model landed, here is what you should do differently next time you open your editor:

Write your next LLM call as f(prompt, settings) -> text. Not as a conversation. Not as a "chat with an AI." A function call. Think in pure inputs and outputs.
Always design around statelessness. If your feature involves memory, the memory lives in your database, your session, or your app state — never "in the model."
Set max_tokens intentionally, check stop_reason on every response, and treat truncation as a real error.
Pick temperature per-call, not per-app. Extraction is 0. Drafting is 0.7. You'll have calls with different temperatures in the same feature.
If you find yourself writing response = llm.chat(...) and passing nothing else, stop and ask: what function am I actually calling right now? What's in the prompt? What settings? What's my plan for when this returns something weird?

Next post, we'll get concrete about the four different ways you can actually call that function — curl, SDK, streaming, and async — and which one you want for which shape of problem. It is surprisingly rare that curl is the right answer in production, and you'll see why.

⬅️ Previous	📍 You are here	Next ➡️
⬅️ Course start	B1.1 of B6.4	Next ➡️ B1.2 · Four Ways to Call a Model

📚 AI for Builders · Course Home — 28 posts, six modules.

Cover photo via Unsplash. This post is part of the AI for Builders series — the builder's companion to AI Zero to Hero.

The LLM Is a Function, Not a Friend

The model: `f(prompt, settings) -> text`

Anatomy of a call

The three settings that change everything

1 · the model

2 · `max_tokens`

3 · `temperature`

The two consequences you'll feel every day

1 · idempotency does not come for free

2 · conversation is a lie the chat UI is telling you

Admit what breaks

What just changed in your code

Course navigation

Comments

AI for Builders

Four Ways to Call a Model, and Which One You Actually Want

More from this blog

A Reading List and Two Habits: Staying Current in Ten Minutes a Week

What to Decide Now, What to Defer, What to Ignore: The AI Action Matrix

The Next 18 Months of AI: A Calibrated Leader's Forecast

Calibrating Your AI Exposure: Upside and Downside in One Matrix

Five AI Capabilities That Matter for Your Business, and Five That Do Not

Command Palette

The model: f(prompt, settings) -> text

Anatomy of a call

The three settings that change everything

1 · the model

2 · max_tokens

3 · temperature

The two consequences you'll feel every day

1 · idempotency does not come for free

2 · conversation is a lie the chat UI is telling you

Admit what breaks

What just changed in your code

Course navigation

Comments

AI for Builders

Four Ways to Call a Model, and Which One You Actually Want

More from this blog

The model: `f(prompt, settings) -> text`

2 · `max_tokens`

3 · `temperature`