Prompts Are Code, Not Config

There are two phases every team building with LLMs goes through.

Phase one: prompts live as string literals inside the file where they're used. A constant at the top of a module. A hardcoded f-string. A snippet pasted into a Slack thread six weeks ago that nobody has dared to touch since because "it seems to work."

Phase two: a senior engineer has had enough, opens a PR that moves all the prompts into a prompts.yaml or prompts.json file, and writes a small loader that pulls them out at runtime. The team calls this "treating prompts as config." Everyone feels better for a week.

And then they wake up to a regression nobody can explain. The YAML file changed three weeks ago. The test suite didn't catch it because there's no test on prompts. The git blame points at a PR that touched twelve prompt strings and has no eval evidence attached. The engineer who shipped it doesn't remember which one was the load-bearing change. Someone says "just roll back the file" and it turns out one of the other prompts in the file was genuinely needed. You have a Friday afternoon of misery.

The framing bug here is in the name. Prompts are not config. Prompts are code, and they deserve the same scaffolding that every other piece of code in your codebase gets: version control, review, tests, releases, and rollback. This post is about how to do that without making a career of it.

The "just a string" trap

Here is the pattern every team starts with, and it's fine for about two weeks:

SYSTEM_PROMPT = """You are a helpful assistant that summarises support tickets.
Output a one-sentence summary under 120 characters."""

def summarise(ticket_text: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=200,
        temperature=0,
        system=SYSTEM_PROMPT,
        messages=[{"role": "user", "content": ticket_text}],
    )
    return response.content[0].text

What's wrong with this? Nothing, on the surface. The prompt is in the repo, it's reviewable in PRs, and git blame will tell you who last touched it. For a project with one prompt, this shape is perfect. Don't over-engineer it.

The pattern starts to hurt when you have fifteen prompts across eight modules, several of them share phrasing, three of them need to be updated when the product renames a feature, and your support rep is asking why yesterday's summaries have a different tone than last week's. At that point, four concrete things have gone wrong:

You can't find all the prompts. They're scattered across files, some inline, some in module-level constants, some constructed with f-strings at call time.
Duplication has crept in. Three modules each have their own version of "be polite, use plain English, keep it under 200 words" because they were written by different people at different times.
You can't answer "what changed?" The prompt PR that broke things touched seven files, and the diff is unreadable.
You have no way to test a prompt change before it reaches production, because your tests are written against code, not against prompts.

The fix isn't a YAML file. The fix is to treat prompts as a first-class software artifact. Here's the shape that works.

The working shape, in four moves

Four moves. I'll walk through each one with the minimum code you need to make it real.

Move 1 · Prompts live in code, as templates

The first and most important move is this: your prompts belong in Python/TypeScript files, as functions that return strings, not in a separate data file. This is counter-intuitive, because "prompts as config" sounds cleaner. It isn't. Here's why.

When your prompt lives in code as a function:

Your type checker validates the variables you interpolate.
Your linter catches typos in variable names.
Your IDE gives you auto-complete.
Your test runner can import and test the function directly.
Your PR review shows prompt changes in-line with the code that uses them.
You can add logic — conditionals, loops over examples, different phrasings per locale — without inventing a template language.

When it lives in YAML:

You need a loader, which is another thing to maintain and cache.
Variable interpolation becomes another DSL ({name}, {{name}}, ${name} — whose syntax?).
Type safety is gone; you find typos at runtime.
You can't write conditional logic without embedding a template engine.
PR diffs are two files instead of one.
Tests need fixture loading.

The shape you want looks like this:

# prompts/summarise_ticket.py
from dataclasses import dataclass

@dataclass(frozen=True)
class SummariseTicketPrompt:
    version: str = "v3"

    def system(self) -> str:
        return (
            "You are an assistant that summarises support tickets. "
            "Write one sentence under 120 characters. Be factual, not "
            "sympathetic. Do not guess at fields that are not stated."
        )

    def user(self, ticket_text: str, product_name: str) -> str:
        return (
            f"Summarise the following ticket for the {product_name} product.\n"
            f"<ticket>\n{ticket_text}\n</ticket>\n"
            "Output only the one-sentence summary."
        )

That's it. A frozen dataclass with a version field, and two methods that return the system and user strings. The user method takes exactly the variables it interpolates, typed. No template language, no loader, no YAML. The call site looks like:

from prompts.summarise_ticket import SummariseTicketPrompt

PROMPT = SummariseTicketPrompt()

def summarise(ticket_text: str, product_name: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=200,
        temperature=0,
        system=PROMPT.system(),
        messages=[{"role": "user", "content": PROMPT.user(ticket_text, product_name)}],
    )
    return response.content[0].text

Clean. Typed. Reviewable. Testable. No magic.

Move 2 · Version prompts like you version APIs

The version: str = "v3" on that dataclass is the second move. Prompts have the same property as public APIs: downstream consumers depend on their behaviour, and changes can be backwards-incompatible. When you edit a prompt significantly, bump the version. When you need to run old and new side by side, keep both classes alive temporarily.

@dataclass(frozen=True)
class SummariseTicketPromptV3:
    ...  # current prompt

@dataclass(frozen=True)
class SummariseTicketPromptV4:
    ...  # new prompt under test

Your call site can then gate on a feature flag:

PROMPT = SummariseTicketPromptV4() if feature_enabled("summarise_v4") else SummariseTicketPromptV3()

This lets you roll out prompt changes the same way you roll out code changes: behind a flag, to a small percentage of traffic, with an easy rollback. When v4 has proven itself in prod for a week, delete v3.

Version numbers aren't decorative. They are what your logs, your eval records, and your analytics use to group outputs. Log the prompt version on every call. When someone reports a weird output next month, you'll want to know which version generated it.

Move 3 · Eval set runs on every PR

This is the move that prevents the Friday-afternoon misery.

An eval set is a small, curated collection of inputs with expected-behaviour checks. It runs on every PR that touches a prompt, just like a test suite runs on every PR that touches code. It doesn't have to be fancy. Fifteen to forty examples, each with a specific property your prompt is supposed to produce. When a PR changes the prompt, the eval set runs, and if it regresses by more than a threshold, the PR fails CI.

A minimal eval looks like this:

# evals/summarise_ticket_eval.py
from prompts.summarise_ticket import SummariseTicketPrompt
from llm_client import call_llm

EVAL_CASES = [
    {
        "input": "My invoice for last month has a $200 charge I don't recognise.",
        "product": "AcmeBilling",
        "must_contain": ["invoice", "charge"],
        "must_not_contain": ["I'm so sorry", "I completely understand"],
        "max_length": 120,
    },
    {
        "input": "Login page is returning a 500 error when I click 'forgot password'.",
        "product": "AcmePortal",
        "must_contain": ["login", "500"],
        "must_not_contain": ["sympathy", "understand how frustrating"],
        "max_length": 120,
    },
    # ... 20+ more cases
]

def test_summarise_prompt():
    prompt = SummariseTicketPrompt()
    failures = []
    for case in EVAL_CASES:
        output = call_llm(prompt.system(), prompt.user(case["input"], case["product"]))
        if len(output) > case["max_length"]:
            failures.append(f"too long: {output}")
        for token in case["must_contain"]:
            if token.lower() not in output.lower():
                failures.append(f"missing {token!r}: {output}")
        for token in case["must_not_contain"]:
            if token.lower() in output.lower():
                failures.append(f"contains banned {token!r}: {output}")
    assert not failures, failures

That's not a sophisticated eval. It's a regex-and-length check on twenty examples. It runs in sixty seconds. It catches roughly 80% of the "someone touched the prompt and broke something" regressions.

Sophisticated evals — LLM-as-judge, human review, golden sets — come later and we cover them in B2.5. Start with the dumb version. Ship it on day one. You can upgrade it when you need to.

Move 4 · Log, sample, and audit in production

The fourth move is what happens after the prompt ships. For each LLM call in production, log at least these four things:

The prompt version (v3, v4)
The inputs you interpolated (with PII redacted)
The model output (with PII redacted)
The model name, temperature, stop_reason, and token usage

This is non-negotiable. When someone reports a weird output tomorrow, you need to be able to answer "which prompt version produced that, with what input, and what was the stop_reason?" If you can answer in three minutes, you're a serious team. If you can't answer at all, you're going to spend a day guessing.

With this logging in place, you can do continuous eval on production traffic. Sample 1% of calls each day, run them through your eval set (or a stricter prod-oriented version), and alert if the pass rate drops. This is the early-warning system for prompt drift — either your users' inputs have shifted, or an underlying model update has changed behaviour.

Two anti-patterns I see constantly

Anti-pattern 1: "prompt config service." A team builds a runtime service that lets non-engineers edit prompts via a web UI, with versioning and approvals. This sounds great — PMs can tweak prompts without deploys. In practice: the tests drift out of sync with the service, no one has the full history in the git log, rollback requires using the UI (which may itself be broken), and you end up with prompt changes that have no code review. If your PM really needs to tweak prompts, the right move is to make the PR process easy enough that they can send one — not to build a bypass.

Anti-pattern 2: hand-written templates in YAML with custom {variable} interpolation. You end up with a half-baked template language and zero type safety. If you can write code, write code. YAML is for data that is genuinely data — feature flags, model names, per-environment settings — not for text with interpolation.

Admit what breaks

This workflow has overhead. The first time you set it up, it's an afternoon. If your codebase has one prompt and one call site, you don't need any of this — the magic string at the top of the file is fine. Don't apply this machinery to hobby projects.
Eval sets decay. You wrote a 20-case eval six months ago. Your product has since added three features. The eval still passes, but it no longer covers what the prompt actually does in prod. Schedule a quarterly review of eval coverage.
Version bumps proliferate. If every tiny wording change bumps a version, you end up with v47 and a call site that looks ugly. Bump versions only for behavior changes big enough that you'd want to A/B or flag-gate them. Small copy tweaks can just be edits to the current version — with the eval set as the safety net.
Logging inputs and outputs has a privacy cost. Check with your legal and security people about what you can log, for how long, and with what redaction. Don't ship a system that logs PII to your analytics warehouse.
Your first eval will be wrong. You'll write a check that says "output must contain the word invoice" and then a user tests an edge case where the right answer genuinely doesn't contain it. That's fine — update the eval. Evals are living code.

What just changed in your code

Move prompts out of string literals into typed template classes (or functions) in a prompts/ directory.
Add a version field to every prompt. Log it on every call.
Write a 20-case eval set for your most important prompt this week, and wire it to run on PR.
Log prompt version, input, output, and usage on every production call. If you can't answer "what made that weird output" in three minutes, you're under-instrumented.
Do NOT put your prompts in a YAML file with a custom template language. Put them in code. Type them. Review them.

Next post, B2.3, we take on the two most-hyped "prompting techniques" of the last three years — few-shot examples and chain-of-thought — and figure out when they earn their tokens and when they're expensive cargo-culting.

⬅️ Previous	📍 You are here	Next ➡️
⬅️ Previous B2.1 · System, User, Tool — the Real Prompt Hierarchy	B2.2 of B6.4	Next ➡️ B2.3 · Few-shot and Chain-of-Thought

📚 AI for Builders · Course Home — 28 posts, six modules.

Cover photo via Unsplash. This post is part of the AI for Builders series.

Prompts Are Code, Not Config

The "just a string" trap

The working shape, in four moves

Move 1 · Prompts live in code, as templates

Move 2 · Version prompts like you version APIs

Move 3 · Eval set runs on every PR

Move 4 · Log, sample, and audit in production

Two anti-patterns I see constantly

Admit what breaks

What just changed in your code

Course navigation

Comments

AI for Builders

Few-Shot and Chain-of-Thought, When They Earn Their Tokens

More from this blog

A Reading List and Two Habits: Staying Current in Ten Minutes a Week

What to Decide Now, What to Defer, What to Ignore: The AI Action Matrix

The Next 18 Months of AI: A Calibrated Leader's Forecast

Calibrating Your AI Exposure: Upside and Downside in One Matrix

Five AI Capabilities That Matter for Your Business, and Five That Do Not

Command Palette

The "just a string" trap

The working shape, in four moves

Move 1 · Prompts live in code, as templates

Move 2 · Version prompts like you version APIs

Move 3 · Eval set runs on every PR

Move 4 · Log, sample, and audit in production

Two anti-patterns I see constantly

Admit what breaks

What just changed in your code

Course navigation

Comments

AI for Builders

Few-Shot and Chain-of-Thought, When They Earn Their Tokens

More from this blog