The Evals-First Loop

I can tell, within about two minutes of joining a new team, whether they're serious about building with LLMs. I don't have to look at the code. I don't have to see the prompts. I ask one question: "Show me your eval."

A team that's serious says "here's the script, here are the thirty cases, here's the dashboard, here's the threshold that blocks a PR from merging." A team that isn't serious says some combination of: "we test by vibes," "our PM tries it a few times," "we don't really have one," or — most commonly — "we were going to do that once we shipped." They haven't shipped. They've had a feature in beta for four months because every new prompt change regresses something subtly and they can't tell what.

This post closes Module B2. We've built up the prompt hierarchy, put prompts in code, learned when few-shot and CoT pay, and walked through the security nightmare that is prompt injection. This is the last missing piece: the loop where you write the eval before you write the prompt, not after. It is the single biggest quality lever you have. It is also the thing teams keep deferring because "we'll get to it."

We're going to get to it right now.

What an eval actually is

An eval — short for "evaluation" — is a small, curated set of input-output expectations for your LLM feature, plus a way to run them and grade the result. That's it. It is not a benchmark, it is not a framework, it is not MLflow or LangSmith or a vendor-sold product. Those things are helpful. They are not necessary to start.

The minimum viable eval has three things:

A list of example inputs — realistic for your product, 20 to 50 cases to start.
A way to check each output against an expected property — exact match, contains-string, length check, regex, a second LLM grading it, or a human clicking yes/no.
A runner that loops through cases, calls your prompt, collects results, and prints a pass/fail summary.

If you have those three, you have an eval. If you don't, you have nothing. Teams that talk about "comprehensive eval strategies" and don't have those three things are stalling.

Here's a minimum viable eval in Python, for a support-ticket classifier:

# evals/classifier_eval.py
# pip install anthropic
import os
from dataclasses import dataclass
from anthropic import Anthropic

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

@dataclass
class Case:
    input: str
    expected: str

CASES = [
    Case("My card was charged twice this month.", "billing"),
    Case("The app crashes on login.", "technical"),
    Case("I want to close my account.", "account"),
    Case("Where do I see my invoice history?", "billing"),
    Case("I can't upload a profile picture.", "technical"),
    # ... 15 more cases
]

def classify(text: str) -> str:
    resp = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=20,
        temperature=0,
        system="Classify into exactly one of: billing, technical, account. Reply with one word.",
        messages=[{"role": "user", "content": text}],
    )
    return resp.content[0].text.strip().lower()

def run():
    pass_count = 0
    for case in CASES:
        got = classify(case.input)
        ok = got == case.expected
        pass_count += ok
        status = "✓" if ok else "✗"
        print(f"{status} {case.input!r:60} expected={case.expected} got={got}")
    print(f"\n{pass_count}/{len(CASES)} passed ({pass_count/len(CASES):.0%})")

if __name__ == "__main__":
    run()

That's forty lines. You can write it in twenty minutes. Run it and it gives you a number — "18/20 passed, 90%." That number is now the thing you defend. Every change to the prompt either keeps the number or moves it.

This is the most important thing I'm going to say in this post: the number is the product. Not the feature, not the prompt, not the demo you showed your boss. The eval number is what makes your feature a thing you can reason about instead of a vibe you have to trust.

The habit is this: before you write or change a prompt, write or extend the eval first.

This looks obvious and feels obvious and almost nobody actually does it. Here's why it's worth enforcing as a rule:

It forces you to operationalise "what good looks like" before you start fiddling. Writing an eval case makes you decide, concretely, what "a good answer for this input" means. Most prompt iteration wastes time because the engineer doesn't have a clear mental target — they're just tweaking until the output feels better.
It gives you a baseline. You know the old prompt scored 18/20. Your new prompt had better score 18/20 or higher, or you're regressing. Without a baseline, you can't tell whether you made the feature better or worse, and most prompt engineers live in that fog.
It catches the "fixed X, broke Y" bug. You add an example to your prompt to fix case #7. You rerun the eval. Case #12 now fails. Without the eval, you would have shipped with #12 broken and discovered it in production.
It makes PR review possible. The PR says "eval went from 18/20 to 19/20." The reviewer doesn't need to eyeball the prompt diff. The reviewer needs to sanity-check the eval and trust the number.

The four grading styles, in order of cost

Your eval runner has to decide whether an output is correct. There are four main ways to do that, each with a cost-and-accuracy trade-off.

1 · Exact match (cheapest, least flexible)

Output string equals expected string. Perfect for classification, routing, schema-bound fields. Fails for anything with creative variance. Use this whenever you can.

def check(output: str, expected: str) -> bool:
    return output.strip().lower() == expected.strip().lower()

2 · Rule-based checks

Length bounds, contains-string, regex match, JSON-shape validation, forbidden-phrase check. Still cheap, still fast, more flexible than exact match. Good for 80% of real cases.

def check(output: str) -> list[str]:
    fails = []
    if len(output) > 120:
        fails.append(f"too long: {len(output)}")
    if "invoice" not in output.lower():
        fails.append("missing 'invoice'")
    if "I completely understand" in output:
        fails.append("sycophantic phrasing")
    return fails

Rule-based checks are your first upgrade from exact match. If you start here and grow into LLM-as-judge later, you'll find that 60% of what you actually want to enforce is expressible as rules.

3 · LLM-as-judge

A second LLM call, given the input, the expected answer, and the actual output, asked to judge whether the output is correct. More expensive (one extra call per eval case), more flexible (can evaluate subjective properties like "polite tone," "faithful summary," "responsive to the question"), and itself an LLM call that can be wrong.

JUDGE_PROMPT = """You are a strict grader. Given an input, an expected answer,
and an actual answer, respond with only one word: PASS or FAIL.

Input: {input}
Expected answer properties: {expected}
Actual answer: {actual}

Is the actual answer acceptable? Reply PASS or FAIL only."""

def judge(input: str, expected: str, actual: str) -> bool:
    resp = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=5,
        temperature=0,
        messages=[{"role": "user", "content": JUDGE_PROMPT.format(
            input=input, expected=expected, actual=actual,
        )}],
    )
    return resp.content[0].text.strip().upper().startswith("PASS")

Use LLM-as-judge when your quality signal is genuinely hard to express in rules — think "is this summary faithful to the source" or "is this recommendation appropriate to the user's context." Understand that the judge is a probabilistic grader and itself needs to be calibrated against human judgement before you trust it.

4 · Human review

The gold standard, the slowest, the most expensive, the most accurate. A human reads the input and the output and clicks yes/no. Use sparingly: only for the most important evals, only when the other three levels don't capture the quality signal, and only with an interface that makes it fast. Teams that set up a simple internal web UI to review 50 outputs a day are building real quality compounding. Teams that send spreadsheets of outputs around email are not.

Rule of thumb: start with exact match and rules for 80% of your eval cases. Add LLM-as-judge for the 15% that can't be rule-graded. Use human review for the 5% that matters most and feeds back into the judge's calibration.

The four ways your eval gets wrong

Evals are not magic. They have their own failure modes. The four I see most:

1 · The eval is too small to be stable

You have 8 cases. Your eval score jumps from 6/8 to 8/8 between runs because two of the cases are near the decision boundary and the model is nondeterministic. You can't tell if your prompt change helped. Fix: more cases. Twenty is a reasonable floor. Fifty is better. Sample from real production traffic as the product grows.

2 · The eval over-fits to the prompt

You keep adding cases that your current prompt fails on. Eventually the eval is a perfect fit for the prompt you have, not a representative sample of real traffic. Your eval score is 100% and your users are complaining. Fix: source cases from real user traffic (anonymised), not from your own intuition about hard cases. Keep the eval aligned with what users actually send.

3 · The check is wrong

Your rule-based check says "output must contain 'invoice'," and a user asks a billing question that's best answered without using the word. The correct answer fails the eval. You change the prompt to always say "invoice," and now it's worse for most users. Fix: when a "correct" answer fails your eval, first update the eval, then update the prompt. Evals are living code.

4 · Nobody looks at the number

You have an eval. It runs in CI. It hasn't failed a PR in three months. The number has been 18/20 for four releases. Nobody has looked at which cases are the failing two. Fix: review failing cases every sprint. They are your best source of prompt-improvement ideas.

Where to put evals in your workflow

Three load-bearing places, all cheap to add:

On every PR that touches a prompt. Fails CI if the score drops by more than a threshold (e.g. 2 cases). This is the single biggest win. Set it up on day one.
As a nightly job against a slightly larger set. Daily signal of quality drift across time. Good for catching upstream model updates that change behavior.
Sampled on real production traffic. Run 1% of prod calls through an eval (a production-friendly version that doesn't have ground-truth labels, so you use LLM-as-judge or rule-based checks only). Gives you continuous quality monitoring and a real measure of drift under the actual input distribution.

The hierarchy, top to bottom, is: fast tight loops in PR CI, slower broader runs nightly, continuous sampling from prod. Each layer catches a different class of regression.

Admit what breaks

Evals take effort up front. The first eval is an afternoon. The maintenance is a few hours a month. Teams skip it because "we'll do it later" and then live with fuzzy quality forever. Pay the price on day one.
A high eval score is not a high-quality product. The eval measures what you put in the eval. If your cases don't represent real traffic, your number is a comfortable lie. Keep cases fresh.
LLM-as-judge judges can be biased. The judge model has its own quirks — it may prefer verbose answers, answers in a particular style, answers that agree with the reference. Spot-check the judge against human review periodically.
Running evals costs money. Twenty cases, three grading calls each, at scale, add up. Budget for it, especially in CI. Most teams are fine; some hit real money at large eval sets.
Evals drift when your product changes. A feature pivots and your old eval is no longer measuring what you ship. Review every quarter.
Evals are not a substitute for user feedback. They're a proxy. They're the best proxy you have in development. But the actual user report of "this answer was wrong" is still the ground truth, and your evals should evolve to catch next time what the user caught this time.

What just changed in your code

Write a 20-case eval for your most important LLM feature tonight. Use exact match and rule-based checks. You don't need LangSmith. You need a Python file and twenty tuples.
Run the eval before and after every prompt change. No "I tried it a few times and it seemed better." Always a number.
Wire the eval to CI so a PR that drops the score by more than your threshold fails automatically.
Source cases from real user traffic (anonymised), not from your own head. Your head has blind spots.
Review failing cases every sprint. They're your best prompt-improvement backlog.
When you upgrade the model, rerun the eval. Model upgrades are the most common source of surprise regressions.

And that closes Module B2 — Prompts as Code. You now have the hierarchy of prompt roles, prompts as versioned code artifacts, honest accounting of few-shot and chain-of-thought, a threat model for prompt injection, and the evals-first loop to tie it all together. You can make calls that behave and you can change those calls with confidence.

Next up is Module B3 — Retrieval, Really. Embeddings in thirty minutes of code, the boring question of whether you actually need a vector store, chunking strategies that make or break RAG, hybrid search, and the failure modes of retrieval. See you there.

⬅️ Previous	📍 You are here	Next ➡️
⬅️ Previous B2.4 · Prompt Injection Is Your SQL Injection	B2.5 of B6.4	Next ➡️ B3.1 · Embeddings in 30 Minutes of Code

📚 AI for Builders · Course Home — 28 posts, six modules.

Cover photo via Unsplash. This post is part of the AI for Builders series.

The Evals-First Loop: Write the Test Before the Prompt

What an eval actually is