Planning vs Reacting: Which Agent Pattern Wins

Read the agent research literature and you'll meet three patterns in quick succession: ReAct ("reason, then act, then reason again"), plan-and-execute ("write the full plan up front, then execute it step by step"), and reflection ("after acting, critique your own output and try again"). Each has a seminal paper, each has advocates, and each is the secret sauce of someone's framework.

Read production agent codebases and you'll meet something quieter: most of them are just reacting — the B4.2 loop from last post, one-tool-at-a-time, no explicit plan, no reflection step, a while loop around messages.create. And a lot of them work fine.

So when does planning earn its tokens? When does reflection actually help? When should you stick with the simple reactive loop and save your money? This post is the honest field guide. No plotting against a paper, no defending a framework — just what I've seen work and fail on real projects in 2026.

What each pattern actually does

Let me define them briefly before comparing. All three are patterns you layer on top of the agent loop from B4.2. The loop itself is the reactive baseline.

1 · Reactive (ReAct-lite). The loop we built in B4.2. Each iteration, the model looks at the conversation so far and picks the next tool call. No explicit plan. The model's "plan" is implicit in its next choice. This is also — confusingly — what most people call "ReAct" in practice, even though the original ReAct paper is slightly more specific about the reasoning-before-action step.

2 · Plan-and-execute. A two-phase pattern. In phase one, the model writes a full plan: "Step 1: get current time in Sydney. Step 2: compute hours until Tokyo meeting. Step 3: format the answer." In phase two, the plan runs step by step, usually with the model still in the loop to handle each step's tool call. The plan is sometimes revised if a step fails.

3 · Reflection. A post-action pattern. After the agent produces an answer, a second model call (or the same model with a different prompt) critiques it: "Did the agent answer the user's actual question? Did it use the right tools? Are there errors?" If the critique finds problems, the agent re-runs with the critique as additional context.

You can combine them. Plan-and-execute with reflection at the end. ReAct with intermediate reflection after every other step. The combinatorial space is what makes the agent literature feel overwhelming.

Three routes to the same answer. Let's look at each honestly.

Reactive: the baseline that mostly wins

The reactive loop wins for most real agents I see because it has three properties the others lack:

It's cheap. One model call per iteration. No planning call, no reflection call.
It adapts to reality. Every iteration, the model sees the actual outputs of previous tool calls and decides the next action based on them. If step 3 fails, the model sees the error and picks step 3-prime.
Modern instruction-tuned models are already good at sequencing. The "pick the next tool call" problem is something frontier models handle well on their own. They don't need a separate planning phase.

Reactive loses in two specific cases:

Very long multi-step tasks where the model has to hold a lot of state in its head and loses track of what it was doing. "Research 10 competitors, compile a table, write a summary, and email it to the PM" benefits from having the plan pinned to the top of the context so the model doesn't drift.
Tasks with expensive irreversible steps where re-running after a mistake is costly. If one of your tools is "send an email" or "charge the card," you want the agent to know what it's going to do before it does it, so a human can approve.

For everything else — most customer support bots, most research assistants, most code agents — reactive is the right default, and the thing to reach for is better tool design, better system prompts, and better evals, not a more elaborate agent pattern.

Plan-and-execute: when the plan is worth writing down

Plan-and-execute shines when the task is big enough that the plan itself is useful. Here's the shape:

# Phase 1: write the plan
PLANNER_SYSTEM = (
    "You are a planner. Given the user's request, output a numbered list "
    "of steps. Each step should be concrete and use one of the available "
    "tools. Do not execute any steps — just write the plan."
)

def write_plan(user_message: str, tools_doc: str) -> list[str]:
    resp = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1000,
        system=f"{PLANNER_SYSTEM}\n\nAvailable tools:\n{tools_doc}",
        messages=[{"role": "user", "content": user_message}],
    )
    text = resp.content[0].text
    # parse numbered list into steps
    return [line.strip() for line in text.split("\n") if line.strip().startswith(tuple("0123456789"))]

# Phase 2: execute the plan, step by step, with tool use
def execute_plan(plan: list[str], user_message: str) -> str:
    # Run the reactive loop with the plan as additional context in the system prompt.
    messages = [{"role": "user", "content": (
        f"Original request: {user_message}\n\n"
        f"Plan to execute:\n" + "\n".join(f"{i+1}. {s}" for i, s in enumerate(plan))
    )}]
    return run_agent_loop(messages)  # reuse the loop from B4.2

The planning step adds ~500–1000 tokens of cost per user request. In exchange, you get:

The plan itself as an artifact — logged, auditable, maybe shown to the user for approval.
Better adherence to multi-step tasks where the reactive loop would drift.
A single point to inject human-in-the-loop — pause after planning, let a human approve, then execute.

Plan-and-execute makes the task slower and more expensive. Every task pays the planning cost even if it's a simple one. For a five-tool agent handling 10,000 user requests a day, that's real money.

Use plan-and-execute when:

Tasks are genuinely long (5+ steps on average).
You need to show the plan to a human for approval before acting.
Your tools have irreversible side effects and you want a visible plan before anything runs.
You're auditing against a compliance requirement that mandates explicit reasoning traces.

Skip it when:

Tasks are usually 1–3 steps.
The model handles sequencing fine on its own (measure with your eval set).
You're latency-sensitive and every extra round trip is visible to the user.

Reflection: the diminishing-returns pattern

Reflection is the most over-applied pattern in the agent literature. The original paper showed big gains on specific benchmarks. Real-world applications have been much quieter.

The typical reflection pattern:

def run_with_reflection(user_message: str) -> str:
    # First attempt
    first_answer = run_agent_loop([{"role": "user", "content": user_message}])

    # Critique
    critique = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=500,
        system="You are a strict critic. Find errors, missing details, or weak reasoning.",
        messages=[{"role": "user", "content": (
            f"User question: {user_message}\n\n"
            f"Answer attempt: {first_answer}\n\n"
            f"What is wrong with this answer?"
        )}],
    ).content[0].text

    # Decide whether to retry
    if "answer is correct" in critique.lower() or "no issues" in critique.lower():
        return first_answer

    # Retry with the critique as extra context
    return run_agent_loop([
        {"role": "user", "content": user_message},
        {"role": "assistant", "content": first_answer},
        {"role": "user", "content": f"A reviewer found these issues: {critique}\n\nPlease revise."},
    ])

Here's the honest picture from production:

Reflection helps on tasks where "better output" is obvious after seeing the first attempt. Coding: the critic sees the code, runs the tests, finds a bug, tells the agent. Math: the critic checks the arithmetic. Writing: the critic spots the typo. For these tasks, reflection can earn its tokens.
Reflection does not help on tasks where the critic can't really tell. If the first answer is a subtle hallucination about a fact the critic also doesn't know, the critic will mostly rubber-stamp the hallucination. On fact-heavy tasks, reflection is confidence-multiplication rather than quality-multiplication.
Reflection doubles your cost. At minimum one extra model call, often two (critique + retry). For 90% of tasks, the second call doesn't change the answer.
Reflection prompts are surprisingly load-bearing. A critic prompt that says "find issues" will find issues even when there are none, creating phantom retries. A critic that says "confirm this is correct" will usually confirm. Calibrating the critic's harshness is its own mini-project.

Use reflection when:

You have a ground-truth check available (tests pass, output validates against a schema, arithmetic can be verified). This is the sweet spot.
The task has clear objective quality criteria that the critic can enforce (a coding agent can check "does the code compile?").
You can afford the extra tokens and latency, and you've measured the quality lift on your eval set.

Skip reflection when:

The critic can't actually verify quality (fact-heavy Q&A, creative writing).
Your eval set shows flat or marginal improvement with reflection on.
You're latency-critical.

A useful middle ground: reflection only when the first answer triggers a specific heuristic. If the output fails schema validation, if a test fails, if a tool call returned an error the agent didn't handle well — then run reflection. Skip it by default.

The pattern nobody calls by name but everyone uses

There's a fourth pattern that doesn't have a good name but works everywhere: structured output as the plan. Instead of a free-text plan or a critique, you have the model emit a structured object at key points in the agent's flow.

For example, at the start: "What's the type of this request? (info_query, action_request, small_talk, out_of_scope)." If it's small_talk, skip tools entirely. If it's out_of_scope, return a canned response. If it's action_request, route to the reactive loop. If it's info_query, route to the RAG-and-answer path.

# pip install anthropic
from pydantic import BaseModel
from typing import Literal

class RequestType(BaseModel):
    type: Literal["info_query", "action_request", "small_talk", "out_of_scope"]
    reasoning: str

def classify(user_message: str) -> RequestType:
    # structured output, from B1.3
    ...

def handle(user_message: str) -> str:
    request = classify(user_message)
    if request.type == "small_talk":
        return handle_small_talk(user_message)
    if request.type == "out_of_scope":
        return "I can't help with that, but here's what I can do: ..."
    if request.type == "info_query":
        return run_rag(user_message)
    if request.type == "action_request":
        return run_agent_loop(user_message)

This is just code with a structured classifier at the top, not an "agent pattern" in the framework sense. But it's the single biggest win for production agent quality I've seen: front-load a classifier to route the user's request to the right handler, and handle each handler the simplest possible way. Small talk doesn't need a 10-step agent. Out-of-scope questions don't need to invoke any tools. Info queries might not need any tool calls at all.

Call this pattern "structured routing" if you need a name for it. Use it liberally. It beats every other pattern in this post on cost and latency and matches most of them on quality.

A practical decision tree

For most new agent projects, the right starting point is:

Start reactive. Use the loop from B4.2. No planning. No reflection.
Add structured routing at the top to avoid running the full agent for off-task inputs.
Measure with evals (per B2.5). Find the cases the reactive loop gets wrong.
For the failure cases, diagnose: is it a tool-design issue? A system-prompt issue? A chunking issue (if RAG)? A budget issue?
Only if the model genuinely can't sequence long tasks well, add plan-and-execute. Only if you have a verifiable quality signal, add reflection.

Most projects never need to reach step 5. The gains from better tool design, better prompts, better retrieval, and better evals exceed the gains from fancier agent patterns, by a large margin, on almost every task I've measured.

Admit what breaks

Reactive loops can drift on long tasks. 15 iterations in, the model has forgotten what it was originally doing. Fix: periodically remind the model of the original question in the system prompt, or use plan-and-execute with the plan pinned.
Plan-and-execute can't handle surprises. The plan was written assuming step 1 would succeed. Step 1 failed. The plan is now invalid. Either throw the plan out and revert to reactive, or rewrite the plan mid-flight — both are additional complexity. Design accordingly.
Reflection can amplify confident wrong answers. The critic trusts the agent's confidence, the agent trusts the critic's sign-off, everyone agrees the wrong answer is right. Break the cycle by having the critic independently verify against a ground truth when possible.
Structured routing classifiers misclassify. A benign user question gets routed to "out_of_scope" and the user gets a rejection. Fix: add an uncertain branch that falls through to the full agent loop, and tune the classifier on real queries.
All four patterns share one failure mode: tool quality. A great agent pattern on broken tools produces confident nonsense. Spend your first week on tools, not patterns.

What just changed in your code

Default to reactive. It's cheaper, simpler, and matches or beats fancier patterns on most tasks.
Add structured routing at the top of every agent to shortcut off-task inputs. This is the single highest-leverage move in this post.
Measure, then decide. Don't adopt plan-and-execute or reflection because a paper liked it. Adopt each one when you have a specific failure mode they address and an eval that shows them helping.
Reflection is best when you have a ground-truth check (tests, schema, arithmetic). Use it there; skip it elsewhere.
If plan-and-execute helps, pin the plan at the top of the system prompt during execution so the model doesn't drift.
Spend more time on tools and prompts than on patterns. Every hour there pays more than an hour on agent architecture.

Next post, B4.4, is the hot one: multi-agent systems. The "team of specialised agents that coordinate" dream. We'll look at when it actually helps (rarely) and when it's just multiplying your bugs (almost always).

⬅️ Previous	📍 You are here	Next ➡️
⬅️ Previous B4.2 · The Agent Loop in 40 Lines	B4.3 of B6.4	Next ➡️ B4.4 · Multi-Agent Systems

📚 AI for Builders · Course Home — 28 posts, six modules.

Cover photo via Unsplash. This post is part of the AI for Builders series.

Planning vs Reacting: Which Agent Pattern Wins in Production

What each pattern actually does

Reactive: the baseline that mostly wins

Plan-and-execute: when the plan is worth writing down

Reflection: the diminishing-returns pattern

The pattern nobody calls by name but everyone uses

A practical decision tree

Admit what breaks

What just changed in your code

Course navigation

Comments

AI for Builders

Multi-Agent Systems: When More Agents Help, When They Multiply Bugs

More from this blog

A Reading List and Two Habits: Staying Current in Ten Minutes a Week

What to Decide Now, What to Defer, What to Ignore: The AI Action Matrix

The Next 18 Months of AI: A Calibrated Leader's Forecast

Calibrating Your AI Exposure: Upside and Downside in One Matrix

Five AI Capabilities That Matter for Your Business, and Five That Do Not

Command Palette

What each pattern actually does

Reactive: the baseline that mostly wins

Plan-and-execute: when the plan is worth writing down

Reflection: the diminishing-returns pattern

The pattern nobody calls by name but everyone uses

A practical decision tree

Admit what breaks

What just changed in your code

Course navigation

Comments

AI for Builders

Multi-Agent Systems: When More Agents Help, When They Multiply Bugs

More from this blog