Skip to main content

Command Palette

Search for a command to run...

The Capability Frontier: Scoping for What Models Will Do in Six Months

Scope against what the model will be able to do in six months, not what it did yesterday. Here is the framework for betting on capability trends without wagering on vapour.

Updated
13 min read
The Capability Frontier: Scoping for What Models Will Do in Six Months

In the time it took your last AI feature to go from spec to launch — let's say four months — frontier models moved forward at least once, probably twice. Claude Sonnet picked up a new capability. GPT-5's reasoning mode got faster and cheaper. Gemini's long-context work crossed a threshold. By the time your feature shipped, the model underneath it had changed under the roadmap you wrote when you started.

You have two choices about this. You can pretend it's not happening and scope to whatever the model could do the day you wrote the PRD — safe, predictable, and guaranteed to ship something obsolete. Or you can scope to where the model will be when you launch, which is riskier but is the only way to build products that feel fresh instead of dated.

The skill is the second one. Scope for the model you will have at launch, not the one you have today. That is not the same as "bet on features that don't exist yet." It is a disciplined way of reading where the frontier is heading and letting those trends shape what you commit to building. Done well, it makes your product feel ahead of the curve at launch. Done badly, it builds you a feature that depends on a capability that never shipped.

This post is the framework, built for the shape of the frontier in April 2026. The specific capabilities will age. The framework won't.


What the frontier actually is

Forget for a moment the marketing-friendly phrase "the frontier" and look at what it describes concretely. At any moment in time, there are three bands of model capability you can reason about separately:

Band 1: reliably shipped. Capabilities that every major frontier model has, that your app can depend on, and that your users will see consistently today. As of April 2026 this includes: long-context reasoning up to ~200K tokens, tool use and function calling, structured output via schema enforcement, multimodal input (images), streaming output, multilingual generation in 20+ major languages. You can spec against any of these with high confidence — they'll still be working a year from now, probably better.

Band 2: shipped but uneven. Capabilities that exist, that some models handle well, but that aren't reliably consistent across providers or workloads. As of April 2026 this includes: computer use / browser automation agents, reasoning modes for hard multi-step tasks, video understanding, cross-session persistent memory, large-scale structured extraction from documents. You can build products on these — Cursor's background-agent work lives here, for instance — but you have to pick your provider carefully, test against a real eval set, and design around the rough edges. These are where the biggest product wins of 2026 are being made.

Band 3: on the near horizon. Capabilities that frontier labs are actively working on, that appear in research demos, but aren't shippable today. As of April 2026 this includes: robust many-hour autonomous agent work, real-time voice with emotion and interruption handling at production quality, near-perfect factual grounding without retrieval, and small local models that match frontier quality on narrow tasks. You should track these but not scope a launch against them.

Three bands, three different ways of betting on them. The whole framework comes down to this: spec the main product on Band 1, pick one or two deliberate wins from Band 2, and never spec against Band 3.

Let's take each rule in order.


Rule 1: spec the main product on Band 1

The core value proposition of your AI product should work on capabilities that are rock-solid today. These are the capabilities where the user's success rate on Day 1 of launch is indistinguishable from Day 1 of the following year. If the whole of your product depends on something in Band 2 or 3, you are building on shifting ground.

Concrete examples of Band-1 products shipping well in 2026:

  • A legal research assistant that reads long documents (200K context, reliably shipped).
  • A support bot that uses tools to query your database and return structured answers (tool use, schemas, streaming — all Band 1).
  • A photo-to-structured-data product like the receipt extractor from Course 2's B6.3 (multimodal input, structured output — both Band 1).
  • A multilingual chatbot serving users in 30 languages (multilingual generation, Band 1).

None of these are boring. All of them are shipping against capabilities that were Band 2 or 3 eighteen months ago and are Band 1 now. The right move is to be early to a Band-1 capability the moment it stabilises, not to chase Band-3 capabilities that may never ship.

The failure mode: picking a Band-1 capability, then adding a dependency on a Band-2 capability as your "differentiator" without realising you've moved your whole risk posture into Band 2. Watch for this in your own specs. If removing the "differentiator" kills the feature, the whole feature is in Band 2 now, not just that one piece.


Rule 2: pick one or two deliberate wins from Band 2

Band 1 alone is safe but often not differentiated. Everyone can spec against Band 1. The products that stand out in 2026 are pulling one or two threads from Band 2 — taking the risk that a capability will stabilise in the launch window and getting there first.

This is where the real product judgement lives. Pick too many Band-2 dependencies and your launch becomes "hope the models keep working." Pick zero and you ship something that looks like 2024. Pick one or two specifically, with eyes open and a measured fallback plan, and you ship something that feels fresh.

A concrete 2026 example: Cursor and similar code environments are betting hard on reasoning-model code review and background agents that work on tasks for minutes at a time. These are Band 2 today. When those capabilities stabilise to Band 1 — probably this year — the products that spent the last six months building around them will look prescient. The products that waited will play catch-up.

How to pick which Band-2 capability to bet on, with specifics:

  • Pick the one where the direction is clearer than the current quality. Reasoning models will get better, longer, and cheaper. Long-running agents will get more reliable. These are trends you can bet on with reasonable confidence because the labs have been publicly improving them every quarter for a year. You can see the gradient.
  • Avoid capabilities whose direction is ambiguous. "The next frontier model will be much smaller and run locally at frontier quality" is a plausible claim. It has been a plausible claim for three years. Don't bet a product on it.
  • Confirm with actual benchmark data, not with tweets. If you're scoping against a capability because a breathless announcement landed last week, wait two weeks and see if serious benchmark runs confirm the claim. Most don't.
  • Have a fallback story for every Band-2 bet. "If reasoning-model quality stalls, we fall back to chain-of-thought prompting on a non-reasoning model." Write the fallback down in the PRD. If you can't articulate it, the bet is too risky for the feature.

The calibration matters. In my experience, one Band-2 bet per major feature is the sweet spot. Two is aggressive but often manageable. Three means your feature is really a Band-2 product and you should re-read Rule 1.


Rule 3: never spec against Band 3

Band-3 capabilities are research results and demos. They are not products. The single most reliable way to set a roadmap on fire is to commit to a Band-3 capability in a spec and hope it lands before launch.

This is harder to resist than it sounds. Band 3 capabilities are exciting. They're what the labs publish. They're what Twitter gets animated about. They're what your CEO saw at a keynote and asked why your team isn't building with. The temptation to bake them into a spec is real, especially for PMs who are trying to match the hype cycle.

Resist by distinguishing "track" from "scope." Tracking means: you read the paper, you note the capability, you keep an eye on whether it makes it into a shipping model. Scoping means: you commit to a roadmap that depends on it. These are very different activities. You can and should track widely. You should scope narrowly.

One more specific defence: six-month-minimum rule. A Band-3 capability has to have been publicly demonstrated to be working reliably for at least six months before it moves into Band 2 and becomes eligible for scoping. Most capabilities that sound revolutionary in a keynote are still broken six months later. The ones that survive six months of real use are the ones that shipped for real. Use that filter.


A 2026 worked example

Let me apply the framework to a real-feeling case to show the full flow.

The project: a product called "Atlas" — an AI research assistant for a biotech team. It reads a corpus of scientific papers, answers questions, and produces structured summaries for grant applications.

What Atlas needs to do, mapped to bands as of April 2026:

CapabilityBandNotes
Long-context PDF readingBand 1200K context is reliably shipped; Claude and Gemini handle 500-page papers well
Multilingual (English + German + Japanese papers)Band 1Multilingual generation is solid
Extract structured grant-application fieldsBand 1Schema-enforced structured output, rock solid
Cite specific sentences in source papersBand 2Citation grounding is shipped but uneven — best on reasoning models; weaker elsewhere
Generate a written draft grant section in the team's voiceBand 2Voice matching at this quality is shipped but inconsistent; needs either few-shot or lightweight fine-tuning
Autonomous multi-hour "read all 40 papers on this topic and produce a literature review"Band 3Long-running autonomous agents of this length are not production-ready
Real-time voice Q&A over the corpus while the researcher talksBand 3Voice-native models exist but not with sub-second latency on top of retrieval

The scoping call:

  • Main product: Band 1 rows. Long-context reading + multilingual + structured extraction. This is the core, and it ships with high confidence.
  • Two deliberate Band-2 bets: citation grounding and voice-matching for drafts. These are the "feels fresh" differentiators. Each has a fallback (fall back to paragraph-level citation; fall back to template-based drafts).
  • Do not spec: the autonomous literature review or the voice Q&A. These go on the "track" list. If they move into Band 2 by launch, great — a fast follow-up. If not, they were never promised.

The launch post for Atlas, in April 2026, would read: "Reads any paper in any major language, extracts the fields your grant needs, and drafts a section in your team's voice with citations you can click through to the source." All Band 1 except the two deliberate Band-2 bets, both of which have fallbacks.

Six months later, if long-running autonomous agents move to Band 2, Atlas ships a "literature review" follow-up. If they don't, Atlas keeps winning on the core it already built. The launch is not dependent on Band-3 capabilities, so nothing breaks.


How to read the frontier without drowning

A final practical question: where do you actually get the data to decide what's in which band? The short answer is the same four signal types from Course 2's B6.4 — frontier model releases, price drops, new capability types, security/behaviour changes — plus a small number of honest sources. For Product People the list is:

  • Provider changelogs from Anthropic, OpenAI, Google. Skim weekly. They tell you what just moved from Band 2 to Band 1.
  • Anthropic's and OpenAI's cookbooks / example repos. These are more honest about what's actually shippable than marketing posts.
  • Benchmark leaderboards — but with a grain of salt. LMArena is useful for calibration. Individual benchmarks are often cherry-picked; trust composites and real-world eval sets.
  • Honest practitioner newsletters. Simon Willison's blog, Ethan Mollick's One Useful Thing, Jack Clark's Import AI. Two of these plus the changelogs is enough.
  • Your own quarterly sweep. Run your eval set against three models, record the scores. Over time you build your own private benchmark of which bands each capability is in for your workload. This is more valuable than any public benchmark.

What to deliberately ignore: AI Twitter as a daily habit, keynote demos, and any post with "this changes everything" in the title. All of them are high noise and low signal for product planning.


The failure mode: "frontier vapour commitment"

One specific anti-pattern worth naming. A PM reads a lab's keynote, sees an exciting capability demo, and writes a feature into the roadmap that depends on it. The demo was cherry-picked; the real capability is two years away; the PM has committed the team to something that will not exist when they need it. Six months into the build, engineering starts reporting that "the capability we depended on isn't actually reliable." The feature slips. The roadmap unravels.

The cost of this pattern is high: three-to-six months of engineering spent on a dependency that never arrives, a broken launch promise to leadership, and a team that stops trusting the PM's judgement on AI capabilities. It is the single most common way I see AI roadmaps get destroyed, and it is entirely preventable with the band framework.

Two concrete defences:

  1. Every capability your roadmap depends on has to have a line in the PRD naming its current band and the specific signals you used to place it there. "We are betting on citation grounding at production quality. It is in Band 2 as of April 2026 based on the Anthropic cookbook, our own eval run at 89%, and six months of public demos working reliably." If you can't write that line, you're in vapour territory.
  2. Every Band-2 dependency has to have an explicit fallback. "If citation grounding regresses, we ship with paragraph-level citations using our own retrieval layer." If you can't articulate the fallback, the dependency is too risky for the feature.

These two defences are cheap to implement and they catch 90% of the vapour-commitment mistakes before they land in a roadmap.


What just changed in your roadmap

  • Adopt the three-band framework on every AI feature you scope. Band 1 for the main value proposition. One or two Band-2 bets for differentiation. No Band-3 dependencies, ever.
  • Write each capability's current band into the PRD, with the specific signals that placed it there. If you can't write the signals, you don't know the band, and the scope is too speculative.
  • Require a fallback plan for every Band-2 dependency. If the capability regresses or stalls, what does the feature become? No fallback = no scope.
  • Track Band-3 capabilities on a separate list, never on the roadmap. Revisit the list quarterly; promote items to Band 2 only after they've had six months of public reliability.
  • Run your own quarterly model sweep (per Course 2's B6.4). Your eval set is the most honest source of band placement for your workload.
  • Default to reading provider changelogs and two honest newsletters. Skip AI Twitter. Your time is better spent.
  • Re-read this framework at the start of every major AI project. The specific capabilities will shift every few months; the bands will always exist.

And that closes Module P1 — The AI Product Mindset. You now have the AI feature vs AI product distinction, the non-determinism tax, the build vs buy vs wrap decision matrix, and the capability frontier framework. Four posts, four frameworks, zero generic PM advice.

Next up: Module P2 — Discovery and Scoping. Finding AI-shaped problems in your product, the demo-to-prod gap, prototyping as a non-coder, and the one-pager shape that survives first contact with engineering. See you there.


Course navigation

⬅️ Previous📍 You are hereNext ➡️
⬅️ Previous
P1.3 · Build vs Buy vs Wrap
P1.4 of P5.4Next ➡️
P2.1 · Finding AI-Shaped Problems

📚 AI for Product · Course Home — 20 posts, five modules.


Cover photo via Unsplash. This post is part of the AI for Product series.

More from this blog

Learn AI - Zero to Hero

111 posts