Fine-Tuning in 2026: What It Is Still Good For
Most teams that think they need fine-tuning need a better prompt or better retrieval. Here are the three narrow cases where fine-tuning earns its cost, and how much data you actually need.
Every team I've worked with has asked this question at some point: "Should we fine-tune our own model?" The question almost always comes wrapped in optimism — a PM or engineer has read an article about fine-tuning, talked to a vendor, seen a benchmark, and become convinced it'll unlock a capability the base model doesn't have.
Nine times out of ten, the answer is no. Not because fine-tuning is bad — it's a real technique with real wins — but because the problem the team is actually trying to solve is either a prompt problem (fix with B2.2), a retrieval problem (fix with B3), or a capability problem where fine-tuning won't help anyway. The tenth case, where fine-tuning is the right tool, is narrower than the hype implies, and when you hit it you can recognise it from a specific shape.
This post is the honest 2026 take. What fine-tuning is for, what it isn't for, how much data you actually need, and the framework to decide before you spend a month on it.
What fine-tuning actually does
Fine-tuning is the process of taking an existing model and further training it on a new dataset — usually a small one (hundreds to thousands of examples) — to nudge its behaviour on a specific task. You're not teaching it new world knowledge. You're not giving it new capabilities. You're changing what it prefers to do when it sees certain inputs.
The mental model to hold: fine-tuning shifts the model's probability distribution toward your dataset's shape. It's a bias adjustment, not a brain transplant.
This is the single most load-bearing point of this post. If you remember nothing else, remember: fine-tuning is about behaviour, not knowledge. The model you end up with knows roughly the same things it knew before you started. It has just been nudged to behave more like the examples you showed it.
In modern practice, most fine-tuning is done as LoRA (Low-Rank Adaptation) or similar parameter-efficient techniques that train a small set of adapter weights rather than the whole model. This makes training cheap, fast, and reversible. A LoRA adapter for a 70B model can be trained on a few thousand examples in an hour on a single GPU and is a few hundred megabytes on disk. That is the reality of modern fine-tuning — a small file that you can deploy, swap, or delete.
The three things fine-tuning is actually good for
Three narrow wins, in rough order of how often they apply:
1 · Enforcing format, voice, or style
The most common legitimate win. You want the model to consistently produce outputs in a specific format, tone, or house style that cannot be reliably achieved through prompting alone. Examples:
- Generate code in your company's exact internal style conventions.
- Produce support responses in your brand's particular voice across languages.
- Output a specific structured format that's too complex for a simple schema and too subtle for prompt instructions alone.
- Match the summarisation style of a specific editor's past work.
Before fine-tuning for format, try prompting harder and try few-shot (see B2.3). If after a serious attempt at both, the model still drifts, fine-tuning on 200-500 high-quality examples will lock the behaviour in. The signal the model needs to learn is small and consistent; LoRA picks it up fast.
2 · Teaching a narrow, repeatable classification or extraction task
If you have a task that reduces to "given this kind of input, produce exactly this kind of output" and you have a few hundred labelled examples, fine-tuning will usually beat prompting on both quality and cost. The fine-tuned model is smaller (or cheaper per call if you use a small base), more accurate on the specific task, and often faster.
Examples:
- Classify legal documents into categories your business cares about.
- Extract specific fields from technical specs in a domain vocabulary.
- Rewrite customer messages into a standardised support-ticket schema.
- Triage incoming bug reports into your issue categories.
The pattern is always the same: narrow, repeatable, well-defined input and output. If you can describe the task as "if you give me X, give me back Y, consistently," fine-tuning is a candidate.
3 · Compressing a big-model prompt into a small-model behaviour
Here's the clever move: you use a big expensive model to generate correct outputs for a large number of inputs, then use those pairs to fine-tune a small cheap model. The small model learns to imitate the big one's behaviour on your specific task distribution and you ship the small one.
This pattern — sometimes called "distillation" in the fine-tuning context — is how teams get 10x cost savings on hot-path calls without quality loss. You need:
- A hot-path task that's called a lot.
- An existing prompt-engineered solution on a big model that works well.
- The budget to generate 1,000-10,000 labelled examples from the big model.
- A small base model to fine-tune (7B-13B is the sweet spot).
Do it right and you can replace 20,000 calls/day to a frontier model with 20,000 calls/day to a LoRA-adapted 8B model, often on your own hardware. The cost delta is enormous; the quality delta on the narrow task is often imperceptible.
What fine-tuning is NOT for
Four things people try to fine-tune for where it doesn't work. If you're about to fine-tune for any of these, stop.
1 · Teaching the model new facts
You want the model to know the details of your product's 3,000 SKUs. You fine-tune it on a dataset of SKU descriptions. The model... sort of learns them. Sometimes. With errors. Less reliably than if you'd just retrieved the SKU description at query time and put it in the prompt.
This is a RAG problem, not a fine-tuning problem. Fine-tuning on facts produces unreliable memorisation. The model will sometimes remember, sometimes hallucinate, and you have no way to tell which. Retrieval (Module B3) is the right answer for facts, and it's dramatically easier to update and audit.
Rule: every time you're tempted to fine-tune for facts, use retrieval instead. Every. Time.
2 · Adding reasoning capability
You want the model to be better at hard reasoning. You fine-tune it on chain-of-thought examples. You see a small improvement on your eval set. You ship. Three months later, the improvement has faded and the model is drifting on other tasks.
Fine-tuning does not meaningfully add reasoning capability to a base model. You can nudge existing capability to be more visible, but you can't create capability that isn't there. For reasoning, you either need a reasoning model (Claude extended thinking, GPT-o-series, etc.) or a smarter base model.
3 · Replacing a better prompt
The most common waste. A team's prompt is sloppy — 200 tokens, no structure, no examples, no clear task definition. They can't get the model to behave. They fine-tune instead of fixing the prompt. The fine-tuning works a little, but a better prompt would have worked more, for zero cost and zero ops overhead.
Rule: spend a day rewriting the prompt (per B2.2) before you spend a month fine-tuning. Most "fine-tuning is necessary" cases dissolve under a rewritten prompt.
4 · Multi-task generalist replacement
You want to replace your frontier model with a fine-tuned small model across your whole product. You fine-tune on a mix of tasks. The small model gets worse at everything uniformly. This is because fine-tuning on a mix of tasks often results in the small model being unable to do any of them well — a small capacity can't hold many distinct behaviours at once.
Fine-tune for one task at a time. If you have five tasks, you need five LoRA adapters, swapped at runtime based on the request. Not one generalist tuned model.
How much data you actually need
A common misconception: fine-tuning needs thousands of examples. The modern reality is much more forgiving:
- 50-200 examples is enough for simple format or voice shaping on a well-tuned base model.
- 500-2,000 examples is enough for a narrow classification or extraction task.
- 2,000-10,000 examples is enough for a distillation job compressing a big-model prompt into a small-model behaviour.
- Beyond 10,000 you're usually in diminishing-returns territory unless the task is very hard or very broad.
The other question is quality. 200 high-quality examples beat 2,000 noisy ones. This is where most teams fail: they rush the dataset, accept mediocre labels, and then wonder why the fine-tuned model is mediocre.
Rules for the dataset:
- Every example should be something you'd be proud to ship as output. If it isn't, don't include it.
- Cover the edge cases you actually care about. If your fine-tune dataset never includes the tricky cases, the model won't learn them.
- Use diverse inputs. If every example is a variant of the same phrasing, the model memorises the phrasing and fails on new phrasings.
- Hold out a test set, not just a training set, so you can measure quality after training.
- Label with intent, not just pattern. Don't include examples where the correct output is ambiguous or debatable.
Budget: building a good 500-example dataset typically takes one or two engineer-weeks including labelling, review, and deduplication. It is not a five-minute job.
The fine-tuning decision framework
If you answer the four questions honestly and you end up at "Fine-tuning is a candidate," the next step is to build a small dataset (100-200 examples), run a LoRA fine-tune on an open model or a managed fine-tuning service, measure against your eval set, and decide. If the numbers lift and they stay lifted after a week of real traffic, you have a win. If not, you learned that fine-tuning wasn't the right tool for this problem — which is a cheap lesson if you limited yourself to a small dataset and a single attempt.
Managed vs self-hosted fine-tuning
If you've decided to fine-tune, the second decision is where to run it. Both managed APIs and self-hosted fine-tuning are viable in 2026.
Managed fine-tuning (OpenAI's fine-tuning API, Anthropic's fine-tuning, Google Vertex, various hosting providers): you upload your dataset, they train, you get back a model you can call via their API just like any other model. Cost is modest for small datasets. You don't own the weights; you can't run it yourself. Convenient for the first experiment.
Self-hosted fine-tuning: you run LoRA training yourself on your own hardware or a rented GPU. Frameworks like axolotl, unsloth, torchtune, or direct use of Hugging Face peft handle the mechanics. You own the weights. You can deploy them to local inference (per B6.1) or to a hosting service. Takes a weekend of setup for a team new to it.
Rule: start with managed for the first fine-tune to validate the approach without spending time on infrastructure. Move to self-hosted only when you want to train many variants, need to deploy locally, or the managed costs become significant.
Admit what breaks
- Fine-tuning degrades general capability. Fine-tune on narrow data and the model gets worse on everything outside that narrow distribution. Measure on a broad eval set, not just your task's eval set.
- Fine-tuning is a moving target. Base models update. Your fine-tune is tied to a specific base version. When the base deprecates, you have to re-tune. Budget for this.
- Datasets rot. Your 500 examples from six months ago may no longer match what users ask. Refresh the dataset when the eval drifts.
- Managed fine-tuning can be expensive at iteration speed. Each experiment is a training job; each job is money. Keep your experiment loop tight.
- LoRA adapters are small but they are production artifacts. Version them, review them in PRs, store them like code. We shouldn't have to say this but most teams don't.
- Fine-tuning bakes in biases from the training data. Whatever implicit assumptions your 500 examples carry — tone, gender, format, cultural context — will be amplified in the fine-tuned model. Review your dataset for bias with the same rigor you'd apply to any training data.
- "I fine-tuned and it got better" can be noise. A 5% improvement on a 100-case eval is within statistical noise. Require real signal before you claim the fine-tune worked.
What just changed in your code
- Default to not fine-tuning. It is the right answer for ~90% of "should we fine-tune" questions.
- Before fine-tuning, run the decision framework — better prompting first, then RAG for knowledge problems, then bigger or reasoning models for capability problems.
- If you fine-tune, make it one narrow task at a time. One LoRA per behaviour.
- Budget one to two engineer-weeks for the dataset. Quality over quantity: 200 good examples beat 2,000 mediocre ones.
- Start with a managed fine-tuning API for your first experiment. Move to self-hosted when you have a repeat use case.
- Always measure on a held-out test set, not on the training data.
- Re-evaluate the fine-tune every model deprecation cycle. It's a living artifact, not a one-time thing.
Next post, B6.3, we get into the most overlooked frontier feature of the year: multimodal, in practice. Not the demo hype. The specific thing that image-in-structured-data-out unlocks for real products, and why it's the most underrated building block of 2026.
Course navigation
| ⬅️ Previous | 📍 You are here | Next ➡️ |
| ⬅️ Previous B6.1 · Local Models — When llama.cpp Wins | B6.2 of B6.4 | Next ➡️ B6.3 · Multimodal in Practice |
📚 AI for Builders · Course Home — 28 posts, six modules.
Cover photo via Unsplash. This post is part of the AI for Builders series.