Skip to main content

Command Palette

Search for a command to run...

Evaluation — How Do You Know an LLM Is Any Good?

The hardest problem in modern AI isn't building the model. It's figuring out whether the model you just built is actually better than the last one. Here's why that's so brutally hard — and what the working state of the art really looks like.

Updated
11 min read
Evaluation — How Do You Know an LLM Is Any Good?

Here is a question that sounds simple and is not.

You've built a thing that uses an LLM. It could be a chatbot, a summarizer, an email drafter, a code assistant, anything. Before you ship it, you want to know if it's good. A more specific version: you swapped your underlying LLM for a newer one. Is the new one better for your use case?

Think about how you'd answer. Your first instinct is probably "run it on some examples and see." Okay, on how many? Picked how? Evaluated by whom? Against what criteria? Using what scoring system? And when the numbers come back — say, the new model gets 83% on your test and the old one got 79% — is that four-point gap real? Is it meaningful? Does it mean the new model is actually better, or does it just mean it's better on those particular questions?

Every one of those follow-up questions is a rabbit hole, and every one of them is where real LLM evaluation actually lives. Most of the people you've seen confidently declaring "Claude is better than GPT" or "this open model beat the benchmark" are skipping at least half of them. In this post we go through each one honestly. No calculus. No code. One uncomfortable truth: nobody has really solved this, and the working state of the art is a mix of crude tools applied with care.


Why this is so hard, in one paragraph

Traditional machine learning had it easy. You had a clear correct answer for every input, you ran your model on a held-out test set, you counted how often it got the answer right, and you reported accuracy. Done. This worked because traditional ML tasks — image classification, spam detection, fraud prediction — had binary or well-defined outputs.

LLM outputs are usually not like that. They're text. Often long text. Often there are many valid answers. Often "good" depends on tone, structure, accuracy, and judgement, all at once. Often the user wanted something slightly different than what they literally asked for. Often the "right answer" is different for different users. Traditional accuracy metrics collapse in the face of all of this. You can't grade a three-paragraph summary the way you grade a spam classifier.

So the field has had to build new methods for evaluation, from scratch, while the models being evaluated were changing under their feet. The result is a toolbox where every tool is flawed and you have to combine several of them to get a trustworthy picture. Let's walk through them.


Method 1 · public benchmarks

The first line of defence is public benchmarks. These are big, standardized test sets covering specific capabilities: math word problems, reading comprehension, coding, common-sense reasoning, multilingual question answering, and so on. You've probably seen the names: MMLU, HumanEval, GSM8K, BIG-bench, SWE-bench, MATH, HellaSwag. Every new frontier model announces its scores on these, usually compared to the previous state of the art.

Benchmarks are valuable for one reason: they're public and reproducible. Anyone can run them, everyone gets roughly the same numbers, and the leaderboard provides a shared frame of reference.

They're also fatally flawed in at least three ways.

1 · benchmarks leak. Because every major benchmark is public, their questions inevitably end up in the pretraining data of the next generation of models. A model that "scores 95% on MATH" might genuinely be good at math, or it might have seen the test questions during training. There is a whole line of research trying to detect benchmark contamination, and the answer is: it's everywhere.

2 · benchmarks don't match your task. Your actual product is not "questions like MMLU." It's "draft a response to a specific kind of customer complaint" or "extract fields from this specific type of document." A model that does great on MMLU might do badly on your task. A model that does badly on MMLU might do great. The correlation is weak and nobody really knows how weak.

3 · benchmarks optimize for what's measurable, not what matters. Benchmarks reward the things that are easy to grade automatically. That means multiple-choice questions, short-answer extraction, code-compile-or-not. They under-measure things like reasoning quality, tone, explaining-yourself-well, knowing-when-to-refuse, and handling ambiguity. Over time, models get trained with benchmark scores in mind, and the things benchmarks don't measure get less attention. This is Goodhart's law in action, and it's why some benchmark-winning models feel worse to use than lower-scoring alternatives.

The honest role of public benchmarks: a rough sanity check and a comparison axis, not a real answer to "is this good for my task". Treat a benchmark win as a weak positive signal, not proof.


Method 2 · a private golden set

When benchmarks aren't enough — and for your specific product, they never are — the next lever is to build your own private benchmark. A golden set of examples from your real use case, with correct answers attached, that you run every model candidate against.

This is what serious teams actually do. The steps are:

  1. Collect 50-500 real examples of the kind of input your product will get.
  2. For each, have a human write (or choose) the ideal output.
  3. Save those examples somewhere nobody will share externally.
  4. For every model change, prompt change, retrieval change, or other system change, run the golden set and compare outputs to the expected ones.

This is unglamorous and it works. A private golden set has the one property public benchmarks don't: it actually measures your task. You can tell whether a change helped or hurt on the things you care about, not on the things some academic decided to grade.

The catch is in step 5, which I haven't written yet: how do you score the outputs? For some tasks, the answer is obvious (the output either matches a regex, or compiles, or doesn't). For most interesting LLM tasks, the answer is "humans have to read and judge," and that's where the real costs show up.


Method 3 · human evaluation

For tasks where quality is subjective, the gold standard is human evaluation. A human reads the model's output and grades it — usually on several dimensions (accuracy, helpfulness, tone, format compliance, whatever you care about). You do this for every output on your golden set, for every model candidate, and you aggregate the scores.

Human eval has obvious strengths: humans can judge nuance, tone, and accuracy all at once, and they can notice things no automated metric would catch. It also has obvious weaknesses: it's slow, expensive, inconsistent between raters, and hard to scale. If your golden set is 300 examples and you have 4 model candidates, that's 1,200 outputs a human needs to read and grade. At 2 minutes per grade, that's 40 hours of work. Per iteration.

A few tricks make it more tractable:

Pairwise comparison beats absolute scoring. Humans are much better at "which of these two is better" than at "rate this on a scale of 1-10." Pairwise comparisons produce cleaner, more consistent data, and you can aggregate them into rankings. This is the approach behind LMSYS Chatbot Arena, which ranks models by having humans pick their preferred response from blind pairs.

Focus on the spots where the system struggles. If 80% of your golden set is clearly fine no matter which model you use, don't waste human time on it. Have the human only grade the cases where models disagree, or where any model's output looks bad. You'll get most of the signal for a fraction of the cost.

Train your raters carefully. Consistency between raters is hard. Give them a clear rubric, show them examples of what "good" and "bad" look like, have two raters look at every example at the start and see how often they agree. If they disagree more than 20% of the time, your rubric is too fuzzy.

Human eval is the most trustworthy signal you can get. It's also the bottleneck in almost every serious LLM team. Which is why people started trying to skip the humans, leading us to...


Method 4 · LLM as judge

Here's a move that feels like cheating: use an LLM to grade another LLM's outputs. Write a careful prompt asking a strong model to compare two answers and pick the better one. Pairwise comparisons again.

This is called LLM-as-judge and it's now standard practice, because it's fast, cheap, and reasonably correlated with human judgement for most tasks.

The catches are real. LLM judges prefer longer answers, are biased by the order of candidates, agree too much with models from the same lineage, and are bad at grading things the judge itself is bad at. Most of these can be mitigated: swap the order and average, cap answer length, use a stronger judge than the models being judged, validate against humans on a subset.

The working pattern: run LLM-as-judge on everything continuously, and have humans rate a random sample to make sure the judge hasn't drifted.


Method 5 · vibes

This one sounds like a joke and I mean it with a straight face. The single most important form of LLM evaluation in most teams is "a senior person plays with it for twenty minutes and says yes or no."

This is called "vibes-based evaluation" by its defenders and "unscientific" by its critics. Both are right. What vibes testing measures is something real: does this model, in the hands of a careful tester with actual context about what the product needs, feel like it's doing the job? And it catches things that all four of the above methods miss — subtle tone problems, weird edge cases, annoying tics, things you can only notice by actually using the product for a while.

The right approach is vibes testing plus the four formal methods, not either alone. Vibes on its own is how teams ship models with regressions nobody caught. Formal eval on its own is how teams ship models that score great and feel awful. A senior engineer playing with the system for 30 minutes catches weird failure modes that never show up in any golden set because nobody thought to include them. That's a superpower, not a bug.

If you're building an LLM product, put vibes testing on the calendar. For every candidate. Every time.


Putting it together: what serious teams actually do

Here's the evaluation pipeline most serious LLM teams converge on.

Level 1 · automated regression tests. Run every change against the golden set, scored by LLM-as-judge. Flag regressions bigger than some threshold. Run in CI on every pull request.

Level 2 · benchmarks. Run a handful of public benchmarks for context and sanity checks. Don't trust them for your specific task.

Level 3 · structured human eval on a sampled subset. Periodically have humans grade a representative sample pairwise against the previous baseline. Keeps the LLM judge honest.

Level 4 · vibes. Before shipping, have at least one senior engineer use the product for twenty minutes with real inputs, looking for "feels wrong" as much as "is wrong."

Level 5 · production telemetry. Once shipped, watch what users do. Thumbs, retry rates, complaints. The ultimate ground truth, slow to arrive.

None of these levels is sufficient alone. Together they catch most issues. That's about as good as the current state of the art gets.


Why this chapter is mostly bad news

I'll be honest. If there's a thesis to this post, it's that LLM evaluation is nowhere near solved, and the methods we have are all crude. Benchmarks leak. Golden sets are small. Humans are slow. LLM judges are biased. Vibes are subjective. Every metric is a compromise.

The main consolation is that doing evaluation badly is still dramatically better than doing no evaluation. The worst evaluation strategy is "I tried it a few times and it felt fine" with no record, no comparison, and no way to notice regressions. A team that runs a 100-example golden set with LLM-as-judge on every change is not running great evaluation by any academic standard, but it's running structured evaluation, and that alone puts them ahead of most of the field.

If you take one practical thing out of this post, take this: build a small golden set of real-world inputs today. Save it. Run it on every change. Even if your scoring is terrible, having anything that catches regressions is worth a lot. You can improve the scoring later. You can't improve what you don't measure.


What just changed in your head

You started this post thinking evaluation was a technical step you'd handle later. You're ending it knowing that it's probably the hardest thing in the entire LLM workflow, that the working tools are crude compromises, and that most teams are winging it with a mix of benchmarks, golden sets, LLM judges, and careful vibes.

One sentence to carry:

There is no good automatic way to measure "did the LLM give a good answer." Serious teams combine several bad methods, treat the combination as a noisy signal, and let a senior human rate at least a sample by hand.

In the final post of this module, we close out by going through the specific failure modes every LLM user should be aware of. Hallucination (again, but deeper). Sycophancy. Jailbreaks. Bias. Each one is a characteristic way LLMs go wrong, each one has partial mitigations, and each one is worth knowing about before you discover it the hard way.


Course navigation

⬅️ Previous📍 You are hereNext ➡️
⬅️ Previous
M5.4 · Finetune vs Prompt vs Retrieve
M5.5Next ➡️
M5.6 · The Failure Modes of LLMs

📚 AI Zero to Hero · Course Home — all 33 posts, six modules.


Cover photo via Unsplash. This post is part of the AI Zero to Hero series.

More from this blog

Learn AI - Zero to Hero

111 posts

How to Evaluate an LLM | AI Zero to Hero