16 min read

Two Patterns That Changed How I Think About Multi-Agent Systems

This post covers two structural patterns for multi-agent systems: why agents can’t evaluate their own work (and what to do about it), and when to actively prune your harness as models improve.

"Tuning a standalone evaluator to be skeptical is far more tractable than making a generator critical of its own work."
-- Anthropic Labs, Harness design for long-running application development (Mar 2026)

This is a companion to my multi-agent guidelines reference. That post covers the what — how to build multi-agent systems with Claude Code. This one covers two patterns from Anthropic’s most recent harness research that change how you should think about designing them. If you haven’t read the guidelines post, it isn’t required, but the structural recommendations here build on the foundations laid there.

Anthropic published “Harness design for long-running application development” (March 24, 2026) by Prithvi Rajasekaran from their Labs team. It builds on two earlier pieces: “Effective harnesses for long-running agents” (November 2025, by Justin Young), which introduced the initializer-agent/coding-agent structure and context resets via structured handoff artifacts, and “Effective context engineering for AI agents” (September 2025), which established the context management principles that underpin everything else. The March 2026 article adds the generator-evaluator pattern, sprint-based decomposition, and — critically — the lessons from stripping those constructs back out as models improved. Together, the three articles form a progression: context fundamentals, then harness design, then harness evolution.

One important clarification before diving in: the harness article uses the Claude Agent SDK, not the Claude Code CLI. The Agent SDK is the underlying framework that powers Claude Code — same tools, same agent loop, same context management, but programmable in Python and TypeScript with more control over orchestration. Claude Code CLI is built on top of it. The patterns described here transfer; the implementation details don’t always map 1:1. I’ll flag the differences where they matter. Critically, the Agent SDK no longer reads filesystem settings (CLAUDE.md, settings.json) by default — you opt in with settingSources. Auto memory (~/.claude/projects/<project>/memory/) is CLI-only and never loaded by the SDK. So when the harness article describes file-based communication between agents, they’re working at a layer below what most Claude Code CLI users touch directly. Source: Agent SDK migration guide, Claude Code features in SDK.


Pattern 1: Agents Can’t Evaluate Their Own Work

This is the most substantive finding in the harness article, and the one with the strongest academic backing. The short version: when you ask an agent to evaluate work it produced, it will reliably grade itself too generously. The fix is structural — separate the generator from the evaluator — and the Anthropic team found this separation is more tractable than trying to make the generator self-critical.

Agents grade their own work too generously

The Anthropic team observed that when asked to evaluate work they’ve produced, agents tend to confidently praise it — even when quality is obviously mediocre to a human reviewer. The earlier harness post (November 2025) documented a related version of this: agents marking features as “done” without proper end-to-end testing, or prematurely declaring the whole project complete when significant gaps remained. The March 2026 article sharpens this into a distinct, named problem: self-evaluation leniency. The generator isn’t lying. It genuinely believes its output is good. The bias is systematic, not adversarial.

This is worse for subjective tasks than for objective ones. For code, you can at least run tests — there’s a binary signal. For design quality, layout polish, UX flow, marketing copy, or anything where “good” is a judgment call, agents reliably skew positive when grading their own work. This matters for anyone building content pipelines, marketing automation, or any system with subjective quality gates that can’t be reduced to a test suite.

Research confirms the bias is systematic

This isn’t just one team’s observation. The self-evaluation bias has been studied directly:

NeurIPS 2024 (Panickssery et al.): LLMs recognize and favor their own outputs. The study found a linear correlation between self-recognition capability and self-preference bias strength — the better a model is at identifying its own text, the more it prefers it. This is academic research with peer review, not an engineering blog post.

ICLR 2026 submission (“Do LLM Evaluators Prefer Themselves for a Reason?”): Stronger models exhibit more pronounced harmful self-preference when they err. In other words, the models that are best at evaluation in general are also the ones that struggle most to recognize when their own outputs are wrong. This is counterintuitive — tuning for better evaluation may actually worsen self-assessment.

Self-correction blind spot research (Tsui, 2025): Across 14 models, LLMs are far more likely to spot errors in external input than in their own outputs — a 64.5% detection rate for external errors versus significantly lower rates for self-generated errors. The mechanism is consistent: models apply different standards depending on whether they produced the text under review.

Separation is more tractable than self-criticism

The Anthropic Labs team found that tuning a standalone evaluator to be skeptical is far more tractable than making a generator critical of its own work. This is the key practical insight. Once external feedback exists, the generator has something concrete to iterate against — specific deficiencies to address rather than a vague instruction to “be more critical.”

This maps to the evaluator-optimizer workflow in Anthropic’s “Building Effective Agents” (which recommends this pattern for literary translation and complex search). But the harness article makes a specific tractability claim that goes further: it’s easier to calibrate an external judge toward harshness than to make a creator self-critical. The academic self-correction blind spot data (64.5% external vs. self error detection) supports the mechanism — the asymmetry is baked into how these models process their own outputs. Confidence level: the Anthropic Labs team reports this is more tractable; the academic data supports the mechanism; I haven’t seen contradicting evidence, but this is still one team’s experience extrapolated.

Calibrating the evaluator

Out of the box, Claude is also a lenient evaluator of other LLM-generated work, not just its own. Self-preference bias is part of it, but there’s also a general tendency toward diplomatic assessment. The harness team’s calibration approach: read evaluator logs, find examples where the evaluator’s judgment diverged from a human reviewer’s, and update the evaluator’s prompt to correct for those specific divergences. They also used few-shot examples with detailed score breakdowns to anchor grading — showing the evaluator what a 3/10 actually looks like versus a 7/10, with concrete reasoning.

Academic research corroborates one mitigation: chain-of-thought reasoning before evaluation reduces self-preference bias (ICLR 2026 submission). Forcing the evaluator to articulate its reasoning before assigning a score creates a check against reflexive generosity. The harness team’s methodology — iterative prompt tuning based on observed divergence from human judgment — echoes the approach described in OpenAI’s self-evolving agents cookbook. Label this as: one team’s approach with external corroboration, not a proven universal methodology.

Interactive evaluation with Playwright

The evaluator wasn’t just reading code — it used Playwright MCP to navigate, click through, and interact with the live running application before scoring. This is the same pattern from the earlier harness post (which used Puppeteer MCP). Both posts confirmed that browser-based testing dramatically improved bug detection over code-only review. An evaluator that reads source code misses rendering bugs, broken interactions, and layout issues that only surface in the running application. The combination of code review and interactive testing catches categories of defect that neither approach finds alone.

What this means for Claude Code CLI and Cowork users

For Cowork users (Anthropic’s scheduled-tasks interface for Claude Code): consider a dedicated QA scheduled task that reviews the output of your production tasks rather than relying on self-assessment within the same session.

In my guidelines post, I recommended a “quality review skill” in Section 7 — but I didn’t address who does the evaluation or how to calibrate it. The takeaway here is structural: if you have a quality gate in your multi-agent system, don’t let the agent that did the work also grade it. Create a separate subagent for evaluation with its own system prompt tuned toward skepticism. Give it explicit instructions to look for gaps, not to confirm completeness.

The evaluator should have different instructions, a different disposition, and ideally different context than the generator. The whole point is that the same context that produced the work also produced the blind spots.


Pattern 2: Your Harness Is Not Permanent Architecture

The first pattern is about adding the right complexity — a separated evaluator that catches what generators miss. This second pattern is the inverse: removing complexity that’s no longer needed. Together they form a discipline: add structure where models fail, remove it where models have improved.

The principle

Every component in a harness encodes an assumption about what the model can’t do on its own. Context resets assume the model can’t maintain coherence over long sessions. Sprint decomposition assumes the model can’t plan and execute multi-feature work without forced structure. Evaluator agents assume the model can’t assess its own quality. These assumptions are worth stress-testing because they may be incorrect today, and they can quickly go stale as models improve.

This extends Anthropic’s “Building Effective Agents” guidance — “find the simplest solution possible, and only increase complexity when needed” — with an inverse discipline: actively remove complexity when the model has evolved past the assumption that justified it. The original guidance tells you to start simple. This extension tells you to return to simple when circumstances allow.

Context anxiety

Both Anthropic and Cognition AI (in their Devin rebuild) documented a phenomenon they call “context anxiety.” Models — specifically Sonnet 4.5 — begin wrapping up work prematurely as they approach what they believe is their context limit, even when they have plenty of room. The symptom was distinctive: Sonnet 4.5 would start writing SUMMARY.md and CHANGELOG.md files to externalize state, underestimating its remaining token budget. It was preparing for a shutdown that wasn’t coming.

Compaction (summarizing the conversation in-place) wasn’t sufficient to fix this because the model still felt like it was near the edge. The earlier harness post’s solution was context resets: killing the agent entirely and starting fresh with a structured handoff artifact that captured the work completed so far. The new session started with full context budget and no anxiety. This worked, but it added significant complexity to the harness — handoff artifact design, state serialization, restart logic.

Opus 4.6 eliminates context resets

The Opus 4.6 launch blog confirmed the model “plans more carefully, sustains agentic tasks for longer, can operate more reliably in larger codebases.” The harness team found Opus 4.6 largely eliminated context anxiety. Where Sonnet 4.5 would start externalizing state and wrapping up prematurely, Opus 4.6 maintained coherent focus through extended sessions. One continuous session ran coherently for over two hours.

The team dropped context resets entirely and relied on the Agent SDK’s automatic compaction. The compaction docs confirm the mechanism: it triggers automatically when input tokens exceed a configurable threshold (default 150K), generates a summary, and continues with compressed context. Currently in beta, supported on Opus 4.6 and Sonnet 4.6. The infrastructure that existed to work around context anxiety — handoff artifacts, restart logic, state serialization — became dead weight.

Sprints were removed

The earlier harness used a sprint construct: one feature at a time, with negotiated sprint contracts between generator and evaluator defining scope, acceptance criteria, and completion conditions. This decomposition was necessary because the prior-generation models couldn’t maintain focus across multi-feature work. They would lose track of earlier decisions, repeat themselves, or drift off-spec.

With Opus 4.6, the generator could handle longer coherent work without forced decomposition. The team dropped sprints entirely. The evaluator moved from per-sprint grading to a single evaluation pass at the end. Less overhead, fewer coordination points, simpler harness.

But the evaluator stayed

This is the key nuance, and it connects the two patterns. The evaluator’s usefulness depends on where the task sits relative to what the model can handle reliably solo. On Opus 4.5, that boundary was close — most non-trivial builds needed the evaluator to catch quality gaps. On Opus 4.6, the boundary moved outward. Tasks that previously needed the evaluator were now within the generator’s solo capability.

But for tasks still at the edge of the model’s reliable solo capability, the evaluator continued to give real lift. It’s not a fixed yes-or-no decision. It’s worth the cost when the task exceeds what the current model handles well alone, and it’s overhead when it doesn’t.

Concrete results

The harness article includes a cost and duration comparison that makes this tangible:

HarnessPromptDurationCost
Solo agent (Opus 4.5)Retro game maker20 min$9
Full harness (Opus 4.5)Retro game maker6 hr$200
Simplified harness (Opus 4.6)DAW (music production)~4 hr$125

The solo run produced a broken core feature — game play didn’t work. The full harness on the same prompt produced a working, playable game. The simplified harness on a more complex prompt (a digital audio workstation) still caught real gaps via the evaluator: stubbed features that showed UI but had no implementation, display-only functionality that didn’t respond to input. The evaluator earned its cost even after the harness was simplified.

What this means for Claude Code CLI and Cowork users

If you built your multi-agent system three months ago, go back and audit it. The sprint decomposition you designed for earlier models may be unnecessary overhead on Opus 4.6. The context resets you engineered around may be solved by compaction. But don’t remove your evaluator just because the model got better — check whether your tasks are within or beyond the model’s reliable solo capability first.

The meta-discipline: every time a new model drops, re-examine your harness. Strip away the pieces that are no longer load-bearing. Redirect that engineering effort toward pushing capabilities the model still can’t handle alone. Your harness should be tight against the current frontier, not frozen at a past one.


Closing

The harness article’s parting observation: “The space of interesting harness combinations doesn’t shrink as models improve. Instead, it moves.” The job isn’t to build a permanent architecture — it’s to keep your harness at the moving frontier between what the model handles and what it doesn’t.

My multi-agent guidelines post covers the structural foundations. This post adds two operational disciplines: (1) never let the worker grade its own work, and (2) periodically audit your harness for assumptions the model has outgrown. Both apply whether you’re building with the Agent SDK, Claude Code CLI headless mode, or Cowork scheduled tasks. The execution layer differs. The principles don’t.


Verification Scorecard (11 claims audited against primary sources)
  1. Self-evaluation bias — CONFIRMED. NeurIPS 2024 (Panickssery et al.) demonstrates linear correlation between self-recognition and self-preference. ICLR 2026 preprint confirms stronger models show more pronounced harmful self-preference when they err. Anthropic harness article documents the phenomenon in practice.

  2. Self-correction blind spot (64.5%) — CONFIRMED. Tsui 2025 survey across 14 models. Consistent with emergentmind.com aggregation of self-correction research.

  3. Separation more tractable than self-criticism — MEDIUM-HIGH. Anthropic Labs team reports this based on their experience building the harness. Academic mechanism support (differential error detection rates for self vs. external). No contradicting evidence found, but limited to one team’s systematic report.

  4. Context anxiety — CONFIRMED. Documented in both Anthropic harness posts (November 2025 and March 2026). Independently documented by Cognition AI in their Devin rebuild blog.

  5. Compaction insufficient for Sonnet 4.5 — MEDIUM-HIGH. Harness article describes compaction as insufficient because the model still perceived itself as near limits. Earlier harness post corroborates with context reset solution. Limited to Anthropic’s internal testing.

  6. Opus 4.6 reduced need for context resets — CONFIRMED. Opus 4.6 launch blog confirms improved sustained agentic performance. Harness article confirms context resets were dropped. Secondary coverage corroborates.

  7. Harness pruning discipline — CONFIRMED for the base principle (“start simple, add complexity only when needed” from Building Effective Agents). MEDIUM for the active-pruning extension (harness article is primary source; the inverse discipline of actively removing complexity is novel).

  8. Sprint contracts pattern — LOW confidence. Single source (harness article). The pattern itself is described clearly, but only one team’s experience. No independent corroboration found.

  9. Evaluator calibration methodology — MEDIUM. Harness article describes the iterative prompt-tuning approach. OpenAI’s self-evolving agents cookbook describes a similar pattern. Chain-of-thought mitigation supported by ICLR 2026 submission. No rigorous comparison of calibration approaches.

  10. GAN-inspired framing (generator + evaluator) — MEDIUM. The structural analogy to adversarial training is well-known. The specific combination (Playwright-based interactive evaluation + separate agent contexts + weighted rubrics) is novel to the harness article.

  11. Agent SDK is not CLI — CONFIRMED. Official Agent SDK docs, migration guide, and Claude Code features page all document the distinction. SDK does not load CLAUDE.md or settings.json by default; CLI does.

Sources

Changelog

  • v1.0 (Mar 2026): Initial publication. 11 claims audited against primary sources.

I'm an independent engineer (ex-eBay) who designs and builds production AI systems. I work deep in the Claude Code and MCP ecosystem, document what I find, and take on contract work. Currently taking on projects. Get in touch .