Autobots: An Agent Harness

Two robot characters at terminals on either side of a shared UI panel, editorial illustration.

Anthropic published a post earlier this year on harness design for long-running app development. Their pattern is three agents in a loop. A Planner expands intent into a spec. A Generator implements features incrementally. An Evaluator tests the running app and grades the work against defined criteria. The point is that agents tend to praise their own work, so the Evaluator role exists outside the Generator. The Generator gets a concrete thing to iterate against rather than its own approval.

That post was the starting point for mine. The reason I needed something like it is that I’m not an engineer. When a Claude Code session finishes and tells me the work is done, I cannot read the diff and decide whether it’s right. I can read the symptoms when it breaks. I cannot read the code and predict whether it will.

Which makes a single-agent loop a non-starter. The agent that wrote the code is also the agent that decides it’s done. If the agent says “tests pass, shipped,” I have no leverage to push back. I tried this anyway for a few months when I first started shipping agent-built work last December, and I ate the consequences. The work passed on the first read and broke in week two. Coverage was high; verification was hollow. The agent would “verify” by running tests on the same fixture it had just written. Tests passed. Real product didn’t.

I needed a second agent in the loop because I was never going to be the second agent in the loop.

My harness is that pattern, with additions smaller than the pattern itself. A Plan-Critic that grades the Planner’s spec the way the Evaluator grades the code. A Lead agent that orchestrates the loop and adjudicates when the others push back on each other. An engagement layer of background watchers that keeps me looped in without forcing me to babysit. An append-only review trail per task that ships with the PR.

The default verdict on a disagreement goes to the Evaluator. The bias I’m trying to counter is the writer’s bias toward being done.

The execution loop

The execution side spawns two long-lived agents per worktree. The Generator reads a plan file, picks up one task at a time, and implements it under TDD. The Evaluator reads the same plan plus a rubric, and reviews each task as the Generator finishes it.

The Generator owns the code. The Evaluator owns the verdict. Neither one writes to the other’s territory. When the Evaluator finds something, it appends feedback to the task’s review file. When the Generator pushes back, it appends pushback to the same file. The trail is append-only on purpose. No edits in place, no compressed summaries, no “we agreed” that hides a disagreement.

Most tasks close after one round of feedback. Some take three or four. A few hit an iteration cap and escalate to me. The Evaluator wins the default verdict so the loop has a bias toward thoroughness, not throughput.

The trail is the artifact

Every task gets one markdown file under .agent-docs/review/. The Generator’s handoff goes in. The Evaluator’s findings go in. Pushback, adjudication, and final acceptance all go in. By the time a task closes, the file reads like a code review thread. Claim, counter, evidence, verdict.

That file ships with the PR. The reviewer (me, or whoever else) opens the PR and reads not just the diff but the conversation that produced it. Most of the time the read is fast. The task did what the plan said, the Evaluator signed off, done. When something looks off, the trail is already there to answer “why this way.” No reconstruction.

This is the part I underestimated when I started. The trail isn’t a debug artifact for me. It’s the actual deliverable, sitting next to the code. The code is the surface. The trail is the proof of how the surface was reached.

Planning is its own adversarial loop

Standard mode runs the Generator and Evaluator pair against a plan I wrote with the agent in-session. Autonomous mode is the version where I walk away. The harness spins up a second pair upstream. A Planner that drafts the design spec and the implementation plan, and a Plan-Critic that grades both. Same loop, two layers up.

Same pattern. The Planner writes, the Plan-Critic grades, the Lead adjudicates when they can’t resolve a disagreement on their own. The Plan-Critic’s verdict is the default on a tie, and the Lead pulls me in only when a call needs human judgment. The Planner and Plan-Critic stay in the team after the plan is accepted, dormant, so I can ask for a v2 cycle later without rebuilding anything.

Here’s the honest version of how this part works. The planning loop is designed to outsource the thinking. That’s its whole purpose. I built it, and when I actually try to use it, I come back to specs that don’t match what I wanted. The Plan-Critic catches structural issues, but it doesn’t catch “this isn’t the thing I was going for,” because that’s not a verdict a critic can grade against without me in the room. People right now are saying “don’t outsource the thinking” about agent-assisted work, and that’s been true for me in this layer. Maybe there are ways to get the Planner better context that improve this in the future. It doesn’t work the way I want right now.

The engagement layer is half the code

The part of the harness that took me longest to get right was not the adversarial loop itself. It was the surrounding infrastructure that kept me looped in without forcing me to babysit. Four background watchers run for the life of a session:

A commit watcher polls git and fires when a task is accepted.
An inbox watcher polls the team’s message inbox and surfaces every message a teammate sends to me.
A stuck-tool watcher pairs every tool call to its result and fires when a result never comes back.
A stream-log tail surfaces teammate tool activity in a terminal view so I can watch live.

Each one exists because of a specific failure mode I hit. The commit watcher exists because message delivery to the Lead is unreliable and acceptance commits would silently pile up while I sat idle. The stuck-tool watcher exists because the first version of the harness lost a Generator to a permission prompt I couldn’t see, sitting on an edit to a protected config path that was hard-blocked even under the bypass-permissions mode I’d granted. The agent’s session stayed alive and produced no further output. I had no signal it was dead until I checked manually.

The harness doesn’t fix those problems. The runtime fixes those problems. What the harness does is make them visible inside thirty seconds instead of three hours. Observability is the quality bar before correctness for AI agents I ship. I keep finding new reasons that’s true.

Lessons from a single-team lifecycle

The first version of the harness ran two teams in autonomous mode. A planning team handed off to an execution team after the plan was accepted. The handoff broke when the runtime got stuck cleaning up the planning team and there was no clean recovery path mid-run.

The fix was a single team for the entire run. Planning agents stay in the team as dormant members after the plan ships, alongside the execution agents. One team-cleanup call at the end, and if that one hangs, leave it. The worktree is done.

The original design was elegant and the failure mode it created was non-recoverable. The new design is uglier (five agents in one team, three of them sleeping most of the time) and the failure mode is “the team object lives forever and gets garbage-collected later.” That trade seemed worth it.

Summary

Adversarial review separates writing from grading. Generator writes, Evaluator grades, I adjudicate. The Evaluator’s default wins because the bias I’m countering is the writer’s bias toward being done.
The trail is the artifact. Append-only markdown files per task ship with the PR. The trail is the proof of how the diff was reached.
Planning runs the same loop, and it’s the part that doesn’t work yet. Autonomous mode adds an upstream Planner and Plan-Critic. Walking away in the early stages doesn’t produce specs that match what I wanted.
Half the code is the engagement layer. Watchers, observability, and the stream-log tail exist because the failure modes I care about are silent without them.
Single team, single delete. The dual-team design was elegant and broke. The single-team design is uglier and holds.

The harness isn’t a tool I use in every session. For one-off edits, the full setup is overkill. A single agent is fine, and that’s still where most of my work lives. The harness is the gear I shift into when I want code I can ship without a second pass from me. That’s the bar that wasn’t reachable on a single agent. That’s the bar I built it to clear.