· ai-agents · pm-workflow

The Multi-Agent Default Is Don't

The first thing most PMs want to do with agents is build a team of them. The research mostly says don't. Here's when that's right, and when it isn't.

The first thing most PMs want to do with agents is build a team of them. A planner that decides the work. A researcher that gathers the context. A writer that produces the output. A critic that reviews it. The instinct comes from how human teams work. The pattern fits the way I think about org charts. It feels right.

I’m guilty of this too. The first thing I wanted when I started building agents was a roster of them. The research says I was wrong.

Cognition, the team behind Devin, published a post titled “Don’t Build Multi-Agents.” Their summary: when they replaced their single-agent architecture with a multi-agent one, the system got worse. The subagents misunderstood each other. They produced stylistically incompatible output. The coordinator couldn’t reconcile assumptions the agents made in isolation. They went back to a single agent with read-only sub-components and the quality improved.

Anthropic’s stance is similar. Their “Building Effective Agents” guidance starts with: “Start with simple prompts, optimize them with comprehensive evaluation, and add multi-step agentic systems only when simpler solutions fall short.” They’ve measured the cost: multi-agent systems consume about fifteen times more tokens than standard single-agent interactions. OpenAI’s practical guide repeats the same advice: maximize a single agent’s capabilities first, add more agents only when separation actually solves a specific problem.

The hardest data comes from Google DeepMind. 180 controlled experiments comparing single-agent and multi-agent architectures, plus a separate UC Berkeley study that analyzed 1,600 production traces from multi-agent systems in the wild.

17.2x
error amplification in unstructured multi-agent
DeepMind, 180 experiments
4.4x
error rate even with centralized orchestration
DeepMind
39-70%
degradation on sequential reasoning tasks
DeepMind
41-86.7%
of multi-agent systems fail in production
UC Berkeley, 1,600 traces

Where multi-agent does work

The research isn’t saying never use multiple agents. It’s saying the default is one, and the exceptions are specific.

Parallel read-only exploration

When agents pursue independent paths and don’t modify shared state, multi-agent improves on single-agent. Anthropic’s multi-agent research system measured 90% improvement on research tasks. DeepMind measured 80.9% improvement on financial reasoning when agents analyzed revenue, costs, and market comparisons in parallel.

Functional separation

Anthropic’s harness design guide splits long-running work into Planner / Generator / Evaluator. Plan-and-Act achieved 53.9% on WebArena-Lite. ADaPT improved ALFWorld by 28.3pp over static plan-and-execute. Works because each agent’s job is genuinely different and the handoff product is well-defined.

High-throughput concurrency

Mount Sinai’s healthcare study found multi-agent held 90.6% accuracy at 5 concurrent tasks vs single-agent dropping from 73.1% to 16.6% at the same load. When the bottleneck is “one task at a time,” multi-agent wins.

That’s three cases. Parallel exploration, functional separation, high-throughput concurrency. Each one has clear measurable wins. Each one fits a specific pattern.

Where it breaks

The common thread of failed multi-agent systems is shared context that the agents can’t actually share. The Planner makes a decision the Generator doesn’t see. The Researcher gathers context the Writer doesn’t get. The Critic reviews output the Writer didn’t intend that way. Every handoff is an information loss boundary, and on sequential work where decisions depend on each other, the boundaries compound.

The “Spark to Fire” paper modeled multi-agent collaboration as a directed dependency graph and found that injecting a single atomic error early in the graph leads to system-level false consensus. Minor inaccuracies don’t get corrected, they get reinforced. Each agent treats the output of the previous one as ground truth.

This is the failure mode I keep watching for when I’m tempted to add an agent. “Would this work if I just gave the single agent better instructions” is almost always yes, with much less downside.

The check I run

Before I add a second agent, I ask:

  1. Is the work I want to split actually independent, or does it depend on context that lives in the parent’s conversation?

  2. Can I define what “good output” looks like at the handoff boundary, precisely enough that the parent can tell whether the subagent did its job?

  3. Is the single-agent version failing for a specific reason, or am I splitting because the human-team analogy feels right?

If the answers are independent / yes / specific reason, I split. If they’re “kind of dependent” / “not really” / “feels right,” I don’t.

I have one multi-agent system in production. It’s my Autobots harness, and it’s a functional separation for long-running work. It took eight months to get the architecture right. The early versions failed in exactly the ways the research predicted. The version that works is the one where the agents’ jobs are genuinely separate, the handoff product is a well-defined file, and the orchestrating agent has read access to everything but write authority over none of it.

Even then, most of the work each agent does is itself single-agent. The multi-agent layer is the smallest part of the system. The bulk of the code is one agent doing one job at a time, with the same care a single-agent architecture would get.

The default is one. Don’t promote to many until the failure mode justifies it.