Multi-Agent Systems: When More Agents Help

Walk into any AI conference talk in 2026 and you will hear the phrase "team of specialised agents." A PM agent, a researcher agent, a coder agent, a reviewer agent. They coordinate. They delegate. They review each other's work. Diagrams show them as distinct boxes with arrows flowing between them. The pitch is that specialisation beats generalisation: each agent is focused, each prompt is clean, each role is clear.

The pitch is seductive and mostly wrong.

Most multi-agent systems I've seen in production are just badly abstracted single-agent systems, with more surface area, more latency, more bugs, and the same quality you'd get from one well-written loop. The "specialised agents" are each calling the same underlying model with slightly different system prompts, the coordination between them is a fragile message-passing layer the team wrote themselves, and the debugging story is a nightmare because when something goes wrong nobody knows which of the five agents produced it.

This post is the field guide I wish existed before I built my first multi-agent system. I will walk you through when multi-agent helps, the three patterns that actually work, the three that don't, and the test you should run before committing to any multi-agent architecture.

No research paper citations. No framework advocacy. Just what I've seen earn its complexity and what I've seen be a net loss.

The case for single-agent by default

Start here, because this is the default you should move away from only with evidence.

A well-tuned single-agent system — a reactive loop with good tools, a clear system prompt, a structured router at the top (see B4.3), and decent evals — will outperform most multi-agent systems because:

Shared context. One agent has the entire conversation and every tool result in its context. It doesn't have to lossy-summarise information for another agent to consume. No context-passing bugs.
One set of prompts to tune. Multi-agent systems multiply your prompt-tuning surface area. Every agent has its own system prompt, its own few-shot examples, its own failure modes.
One place to debug. When something goes wrong, you look at one loop and one message history. Not at a message bus between five agents with different personalities.
One place to enforce budgets and safety. A single agent has one max-iterations cap. A multi-agent system has a budget per agent plus a total system budget plus cross-agent communication overhead, and any of them can blow up.
Modern frontier models are already general. The pitch for specialised agents is that "you can tune each one for its role." In practice, the models are general-purpose enough that role-specific tuning gains 5% where switching from a reactive loop to a multi-agent setup adds 3x complexity and 3x failure modes.

Before reaching for multi-agent, answer this honestly: is my single-agent baseline actually tuned? Do I have evals? Do I know what it gets wrong? Have I tried better tools, better prompts, better retrieval? If the answer is no, multi-agent will not save you — it will just give you a more complicated baseline to not tune.

The three multi-agent patterns that actually work

With that caveat front and centre, here are three shapes where multi-agent genuinely earns its complexity.

1 · Orchestrator with sub-agents for isolation

The orchestrator pattern: one main agent receives the user's request and delegates sub-tasks to other agents, each running in an isolated context. The sub-agents never see the main conversation — they see only the task the orchestrator hands them.

When this wins:

Context hygiene matters. Your main agent is having a sensitive conversation and you want to do a web search as part of answering, but you don't want the web content to have any chance of injecting instructions into the main conversation. A sub-agent with its own context, tools, and constrained output schema is a safe way to fetch and distil.
Different roles need different tool sets. Your main agent handles the user conversation and has a restricted tool set (no delete_user). A sub-agent has elevated permissions for specific safe operations, invoked only when the main agent explicitly delegates to it, with an audit trail.

def orchestrator(user_message: str) -> str:
    # Main loop with limited tools.
    messages = [{"role": "user", "content": user_message}]

    while True:
        resp = call_model(messages, tools=SAFE_TOOLS + [DELEGATE_TOOL])
        if resp.stop_reason == "end_turn":
            return resp.text
        for tool_call in resp.tool_calls:
            if tool_call.name == "delegate_to_research_agent":
                # Sub-agent in isolated context.
                result = run_research_agent(tool_call.args["task"])
                messages.append_tool_result(tool_call.id, result)
            else:
                result = run_local_tool(tool_call)
                messages.append_tool_result(tool_call.id, result)

The sub-agent is just a function. The "multi-agent" part is the isolation boundary, not the existence of a second prompt.

2 · Parallel fan-out for independent work

Some tasks are genuinely embarrassingly parallel. "Research five competitors and summarise each one" is five independent research tasks, and running them sequentially is wasted latency. A main agent can fan out to five sub-agent instances running in parallel, each with its own task and context, then merge the results.

async def research_competitors(competitors: list[str]) -> list[dict]:
    async def research_one(name: str) -> dict:
        return await run_research_agent(f"Research company: {name}")

    return await asyncio.gather(*(research_one(c) for c in competitors))

This is a valid multi-agent pattern because the sub-agents are genuinely independent (no coordination needed between them) and the parallelism is a concrete speedup. Notice what it is not: it's not "five specialised agents with different personalities arguing about the answer." It's one agent kind, five instances, run concurrently.

3 · Critic-worker pair for verifiable quality checks

A narrow case of reflection from B4.3, but structured as two agents with strict role separation:

Worker agent produces the answer.
Critic agent independently verifies the answer against a check (tests pass, schema validates, math is consistent).

This earns its complexity when the critic can actually verify, not just opine. A code worker + test-runner critic is a real win; the critic's verdict is grounded in test results. A prose worker + vague critic is not a win; the critic is just the same kind of model producing more confident prose.

def worker_critic(user_message: str, max_retries: int = 3) -> str:
    for attempt in range(max_retries):
        answer = run_worker_agent(user_message)
        verdict = run_critic_agent(user_message, answer)
        if verdict.passed:
            return answer
        user_message = f"{user_message}\n\nPrevious attempt issues: {verdict.issues}"
    return answer  # final attempt

Note the key detail: the critic has to have a verifiable signal. Running tests. Validating a schema. Checking arithmetic. Looking up a source in a database. Without a verifiable signal, the critic is just another opinionated LLM, and its opinions don't make the answer more true.

The three patterns that usually don't work

1 · "Team of personalities"

The most popular and least useful pattern. Five agents with different personalities — a PM agent, an engineer agent, a designer agent, a sceptic agent, a manager agent — debating and refining the answer. This pattern looks intelligent. It produces lots of conversation. It consumes a lot of tokens.

In practice:

The agents mostly agree with each other because they're the same underlying model.
Disagreements are resolved by whichever agent goes last or speaks longest.
The quality of the final answer is usually the same or slightly worse than a single-agent answer with a good prompt that says "consider the PM perspective, the engineering perspective, and the design perspective."
Debugging a wrong answer means untangling five personalities' contributions.

The "team of personalities" pattern is a way to feel like you're doing serious AI work. It is not usually a way to produce better results. If you need multiple perspectives, a single-agent prompt that asks for multiple perspectives is cheaper, faster, and clearer.

2 · "Specialised by domain"

"A billing agent, a technical agent, a shipping agent" routing by domain. This sounds organised but usually just replicates the structured routing pattern from B4.3 with extra steps.

You do not need separate agents to handle billing vs technical questions. You need:

A classifier at the top (one structured-output call) to route the request.
Each domain gets its own system prompt, its own tool set, its own RAG scope.
The "agent" for each domain is the same reactive loop with different configuration.

Calling each configuration a "specialised agent" is naming, not architecture. The architecture is "one loop, configurable per route." Much simpler to build, debug, and improve.

3 · "Long chain of agents"

Agent 1 produces output A. Agent 2 reads A and produces output B. Agent 3 reads B and produces output C. By agent 4, the original user intent has been lossy-compressed through three handoffs, each of which had a chance to misread, summarise wrong, or drop a constraint. By agent 5, the answer has drifted from the question.

This is the pattern where "you're just multiplying bugs" is most literally true. Each handoff is a compression step. Each compression step has a chance of error. With five agents and a 95% per-step fidelity, you're at 77% end-to-end fidelity. At 90% per-step, you're at 59%. The math is brutal.

If you find yourself chaining four or more agents, stop and ask: can I do this with one agent and more context? The answer is almost always yes. The rare exception is when the chain is actually a pipeline of distinct transformations (parse → retrieve → compose → format), in which case it's not really "agents" — it's a pipeline with LLM nodes, and you should architect it as a pipeline with logging at each stage, not as a group of agents passing messages.

The test before you commit

Here is the test I run before committing to any multi-agent architecture. Write down the answer to each question:

What is the single-agent baseline quality on my eval set? Don't skip this. Measure it.
What specific failure modes does the multi-agent design address? Not "it's more modular." Specific failures: "the single agent forgets the research context when switching to code mode."
What does the multi-agent design add in latency and cost? Do the math. Roughly, each additional agent handoff adds 2–5 seconds and 2x the token cost.
What does the debug story look like when something goes wrong? Walk through a hypothetical bug end-to-end. If the answer is "log every message between agents and read them by hand," that's a mess, not a feature.
Could I achieve the same thing with one agent and better prompting? Nine times out of ten, yes.

If after this exercise you still want multi-agent, go ahead — you've earned it. If you're hesitating, your gut is right.

Admit what breaks

Message bus complexity. Getting agents to talk to each other requires a message format, serialisation, error handling, and timeouts. You are writing distributed systems. Distributed systems are hard.
Token budget drift. Each agent adds its own system prompt, tool definitions, and context window consumption. A five-agent system can consume 5x the tokens of a single-agent system for comparable tasks. Budget accordingly.
Debugging requires cross-agent tracing. You need a trace ID that follows a request through every agent's message history. LangSmith, OpenTelemetry, or roll-your-own. Without tracing, multi-agent is effectively undebugable.
Consistency is hard. Agent 1 says "the user is a premium customer." Agent 3 didn't get that context and assumes free-tier. You now have contradictory behaviour. Every cross-agent piece of state is a consistency problem waiting to happen.
Cascading failures. If agent 2 fails, agents 3–5 never run, or run on bad input. Error handling multiplies.
Most teams underestimate single-agent quality. They give up on single-agent before really tuning it, then spend weeks on a multi-agent system that ends up at the same quality bar.

What just changed in your code

Start single-agent. Stay single-agent as long as possible. The default is the default for a reason.
Use structured routing (per B4.3) before "specialised agents." Routing is usually what you actually want.
Before reaching for multi-agent, tune your single agent hard. Better tools, better prompts, better evals. 9 out of 10 quality problems live there.
If you do go multi-agent, only use the three patterns that work: orchestrator-with-isolated-subagents, parallel fan-out, critic-worker with a verifiable signal.
Never chain more than 3 agents. Each handoff is a compression step.
Invest in tracing before you invest in multi-agent. You will need it; build it first.
Ask: could I do this with one agent and better context? If yes, do that.

Next post, B4.5, we close out Module B4 with the unglamorous practical question every agent team eventually hits: long-running agents. State, resumption, idempotency, and why you need a database before you need a graph framework. The plumbing that turns a demo into a product that survives process restarts.

⬅️ Previous	📍 You are here	Next ➡️
⬅️ Previous B4.3 · Planning vs Reacting	B4.4 of B6.4	Next ➡️ B4.5 · Long-Running Agents

📚 AI for Builders · Course Home — 28 posts, six modules.

Cover photo via Unsplash. This post is part of the AI for Builders series.

Multi-Agent Systems: When More Agents Help, When They Multiply Bugs

The case for single-agent by default

The three multi-agent patterns that actually work

1 · Orchestrator with sub-agents for isolation

2 · Parallel fan-out for independent work

3 · Critic-worker pair for verifiable quality checks

The three patterns that usually don't work

1 · "Team of personalities"

2 · "Specialised by domain"

3 · "Long chain of agents"

The test before you commit

Admit what breaks

What just changed in your code

Course navigation

Comments

AI for Builders

Long-Running Agents: State, Resumption, and the Database You Need First

More from this blog

A Reading List and Two Habits: Staying Current in Ten Minutes a Week

What to Decide Now, What to Defer, What to Ignore: The AI Action Matrix

The Next 18 Months of AI: A Calibrated Leader's Forecast

Calibrating Your AI Exposure: Upside and Downside in One Matrix

Five AI Capabilities That Matter for Your Business, and Five That Do Not

Command Palette

The case for single-agent by default

The three multi-agent patterns that actually work

1 · Orchestrator with sub-agents for isolation

2 · Parallel fan-out for independent work

3 · Critic-worker pair for verifiable quality checks

The three patterns that usually don't work

1 · "Team of personalities"

2 · "Specialised by domain"

3 · "Long chain of agents"

The test before you commit

Admit what breaks

What just changed in your code

Course navigation

Comments

AI for Builders

Long-Running Agents: State, Resumption, and the Database You Need First

More from this blog