13 min read

Adversarial Subagent Teams: A Retrospective

71 deploys of the adversarial subagent team pattern over 40 days, drawn from project logs across two corpora and validated against the actual PRs, commits, and arXiv references. One subagent-reported finding got rejected when its citation didn’t survive re-checking.

The pattern itself is a dispatch shape: one or more Claude subagents are told to break, not validate, a piece of work that another subagent (or the main agent) just produced. The reviewer is asked for problems, code-line-specific and falsifiable; a separate subagent then triages and fixes.

Post 21 surveyed the published research on agent self-evaluation bias and the patterns Anthropic Labs uses to work around it. This post is the empirical flip side of that survey, with a falsification trail run against every cited fact before publish.

Verification Scorecard
#ClaimVerdict
1Deploys correlate with subsequent in-session actionsCONFIRMED: 65 of 71 deploys followed by action (smoothed confidence 0.89)
2A fabricated arXiv citation got caught by an adversarial reviewerCONFIRMED: arXiv:2212.08459 authors verified externally
3A subagent-reported path-traversal catch in chat-arch PR #52REJECTED: the cited PR is a workshop-bundle feature; the actual 2026-05-08 adversarial-review fix (ac86e49) is TOCTOU race + atomic ledger, not path traversal
4Three commits in the same project trace to adversarial review findingsCONFIRMED: all 3 commit hashes verified with cited messages
5Convergence is fast and boundedCONFIRMED with caveats: median 2 iterations, mean 2.41 over n=71; "iteration" counting documented in §6
6Paired (optimistic+adversarial) outperforms solo adversarialNOT SUPPORTED: permutation p=0.169 with n_paired=9
Every cell above is independently verifiable against the cited source (arXiv page, git commits, GitHub PR, the underlying 5-tuple extractions).

1. The methodology

A retrospective like this can produce confident-and-clean output that contains errors a falsifier-agent would catch in a single Read. The methodology uses five hard rules from a custom pattern-retrospective skill I wrote after an earlier handoff document failed exactly this check:

  1. Audit any external system before specifying it. Earlier handoff work to a sibling system, chat-arch, failed this; about 60% of its proposed “requirements” turned out to already exist in chat-arch. The methodology used here verifies every “X system does Y” claim against the actual source before relying on it.
  2. Streaming parsers only. No whole-file loads. A first-pass extraction via in-context Explore agents hit context-window overflow on 11 of 12 Cowork sessions because subagent transcripts run 20 KB to 11.9 MB each. The streaming JSONL parser used here handles any size by reading line-by-line.
  3. Verify before citing. Every commit hash, PR number, URL, and file:line reference gets an independent verification step recorded in a falsification trail document.
  4. Confidence smoothing with a skeptical prior. No “X always happens” from n=1. confidence = supporting / (supporting + contradicting + 2). The +2 in the denominator is a skeptical pseudo-count: it pulls small-n estimates away from 0.0 and 1.0 and biases low against overclaim. (It is not the Laplace rule of succession, which uses (s+1)/(s+c+2); this formula deliberately omits the +1 in the numerator so that a single supporting observation maps to 0.33, not 0.50.)
  5. Provenance tags inline. Every load-bearing claim carries [deterministic] / [falsifier-verified] / [inferred-from-action] near the number.

2. Inventory

71 distinct user-typed deploys of the adversarial-review-team pattern over 40 days (2026-04-13 to 2026-05-22), across both Claude transcript corpora (Claude Code CLI under ~/.claude/projects/, Cowork local-agent-mode under %AppData%\Claude\local-agent-mode-sessions\).

Detail
Files scanned (streaming)3,525 JSONL across both corpora
Raw candidate hits318 (containing “adversarial” / “red team” / “skeptic”)
Classification drops9 named false-positive classes (subprocess noise, scheduled-task wrappers, session continuations, etc.)
Deploys kept143 raw, deduped to 71 per skill §2 priority (queue-enqueue > cowork-audit > CLI)
Variant phrasings observed13, re-derived from corpus

The earliest “adversarial review TEAM” deploy is 2026-04-13. Earlier March 2026 hits are a different pattern (single-persona “adversarial buyer”), excluded from this analysis.

3. Finding #1: deploys correlate with subsequent actions

[deterministic] Across 71 deduped invocations, 65 (~92%) show ≥1 subsequent Edit / Write / Bash-with-git-or-gh tool_use action in the parent file’s window. Smoothed confidence: 0.89 (supporting 65, contradicting 6, prior 2).

One important caveat up front: “actions followed” captures Edit / Write / git tool_uses, not shipping. It does not confirm a commit landed in main or a PR merged. The 0.89 number describes in-session activity after the dispatch, not downstream production impact.

This is a clean deterministic signal because the underlying tool_use events are timestamps in the JSONL, not a classifier output. It does not prove that adversarial framing caused the actions. It shows that deploys of this pattern are followed by activity ~92% of the time. The 6 “no_actions_observed” cases are mostly Cowork queue-enqueue prompts whose subagents fire in a sibling CLI subprocess transcript not opened in this pass, so the 65/71 likely understates by an unknown amount.

4. Finding #2: an adversarial subagent found six PII-bypass cases in a redaction module meant to prevent them

[falsifier-verified] The most concrete demonstration in this 71-invocation sample came on 2026-05-06. The prompt:

“Read through issue 210 and use a subagent team to plan, build, and iteratively test using adversarial reviewers.”

The work: a pilot feedback-and-log reporting feature on a client project (name and GitHub org redacted for client privacy; PR numbers and commit hashes preserved against the falsification trail). End-user feedback and crash logs flow from a desktop client up to a Cloud Run server. The redaction module at packages/feedback-redaction/ sat between the local capture and the network call. Its specific job was to scrub secrets, PII, and tokens before anything left the device.

Fourteen subagents ran in waves across the implementation. After the redaction module was wired up, an adversarial security reviewer was dispatched with instructions to break it:

“You are an adversarial security reviewer. Working dir: c:\Users\Bryce\Projects\<project>. Review the redaction module just built for the project’s issue #210. Your job is to FIND PROBLEMS, not validate the work. Be skeptical and specific.”

The reviewer numbered its findings F1 through F12 in rough severity order. The bottom line from its first-wave report:

Six confirmed bypasses ship raw secrets / PII (F1, F2, F3, F4, F5, F6) on inputs that will appear in real Cloud Run logs and feedback payloads. F7 is a contract bug that could cause secondary leaks at the caller. F8–F12 are realistic edge cases that should be addressed before pilot. The test suite has structural gaps that allowed all of the above to land in main.

Recommend: do not ship until F1–F6 + F12 are fixed and tests added.”

The core result: the redaction reviewer’s findings drove 12 confirmed bypass fixes by merge. The first-wave report counted six confirmed bypasses (F1–F6) among twelve numbered findings, with F7 a contract bug and F8–F12 edge cases; subsequent adversarial passes brought the redaction reviewer’s confirmed-bypass count to 12. PR #211’s description records the final count: 30+ concrete issues across three parallel adversarial reviewers (redaction, cloud + main, dialog UX + a11y), with the redaction reviewer specifically credited with “12 confirmed bypasses fixed.”

PR #211’s test plan reports 176 feedback-related tests passing at merge: 86 redaction-suite tests, 27 cloud-route tests, 10 main-process tests, 17 portal-dialog tests, 5 widget-dialog tests, 4 entry-point tests, 27 regression-coverage tests.

A separate fix subagent landed the corrections. The deferred contract issue (a sha256 field in the feedback payload that was never server-verified) got its own dedicated cleanup subagent that touched 10 files.

The feature landed as PR #211 (feat: pilot feedback + log reporting (#210), merged 2026-05-06T13:58Z). A same-day follow-up commit (19ecea4 fix(feedback): reject non-ISO timestamps at validation boundary (#210)) closed a validation gap that surfaced after merge. The feature also went through a same-day revert (commit c172d74) and restore (PR #217, merged 2026-05-06T14:14:10Z); the redaction module work survived intact across that churn and is what ships today.

The adversarial reviewer wasn’t asked to validate or suggest. It was told to break the just-shipped module, and produced code-line-specific findings that a follow-up agent could act on.

5. Finding #3: what the falsification trail confirmed and what it rejected

[falsifier-verified] Three citations survive re-verification cleanly. One does not.

Held up:

  • arXiv:2212.08459, an adversarial reviewer caught a hallucinated author attribution. The citation named “Egger & Yu” as the paper’s authors; the reviewer was asked to open the URL and verify, and the names didn’t match the arXiv page. The actual paper is “Experiments on Generalizability of BERTopic on Multi-Domain Short Text” by Muriël de Groot, Mohammad Aliannejadi, Marcel R. Haas (verified at https://arxiv.org/abs/2212.08459 on 2026-05-22). One agent caught another agent’s hallucination on a check the original agent could have run but didn’t.
  • 3 commits in the same project repo (6af2a8b, 202ab39, 24f1ded). What’s verified: all three commits exist with the cited messages. What’s not verified: the causal claim “specific finding → specific commit line.” That step would require an LLM-classification pass not run for this study.

Did not hold up: a subagent in this study reported a “path-traversal risk in detectCorrectionCandidates.ts, fixed in chat-arch PR #52, 2026-05-08.” The finding was fabricated wholesale: real file name, real PR number, plausible-sounding vulnerability, none of it true.

The verification chain: at verification time (2026-05-22), PR #52 was a workshop-bundle feature consolidation in OPEN state, not a path-traversal fix; it was closed without merging on 2026-05-23. The actual 2026-05-08 adversarial-review fix is commit ac86e49 (fix(corrections-loop): address adversarial review findings (P0+P1)), and its commit body documents ten P0/P1 fixes (concurrency, atomic ledger writes, UI copy), but no path traversal. The cited file exists in chat-arch but isn’t touched by that commit. The catch required opening the actual PR rather than trusting the subagent’s reported finding; the scorecard row 3 verdict (REJECTED) reflects what was found.

[smoothed confidence 0.60] “Pattern reaches shipped code” stays below the 0.75 promote threshold at n=3 verified commits with prior=2. Honest reporting requires keeping it there until more commits with self-attribution are catalogued.

6. Finding #4: convergence stats and what “iteration” means here

[deterministic] Across 71 invocations, median = 2 iterations, mean = 2.41, range 0 to 11. Restricted to the 65 sessions with subagents observed in window, median = 2, mean = 2.63. The mean of 2.41 includes 6 zero-iteration cases (deploys where no subagent files landed in the parent invocation’s window).

“Iteration” here means a cluster of subagent file first_ts timestamps more than 2 minutes apart in the parent invocation’s time window, capped at the next deploy on the same parent file. A long session where Claude continuously fires new subagents on the same review counts as multi-round even if the user never typed “iterate again.”

Limitation: that definition is not the same as user-typed re-dispatch rounds. Distinguishing “Claude-initiated continuation” from “user-initiated iteration” would require an LLM-classification pass not run for this study.

7. Finding #5: paired vs. solo lens is not significant at this n

The paired (optimistic + adversarial) phrasing first appears 2026-05-13. The implied “better outcomes” claim doesn’t survive a significance test:

nmean iterations
Paired deploys93.44
Solo deploys622.26
Observed Δ+1.19
Permutation p (10,000 perms)0.169

Note on the Δ row: the unrounded means yield 1.19; the rounded values shown above subtract to 1.18.

The permutation p-value is the gating statistic. As a sanity check, Welch’s two-sample t = 1.23 with df = 9.73 puts this well below conventional significance for n_paired = 9. The observed direction (paired correlates with more iterations) doesn’t clear p ≤ 0.05. No ranking claim of paired vs. solo. It might be a real effect under larger n; the current sample can’t say.

8. What this data does NOT support

Several claims that commonly attach to adversarial-review patterns are not supported by this dataset:

  • Adversarial framing outperforms non-adversarial framing (no A/B run; tested phrasing-space is uncontrolled).
  • Paired outperforms solo (p=0.169, n_paired=9).
  • 69% confirmed-finding rate. Confirming it would require an LLM-classification pass over 65 captured subagent final messages + 395 follow-up user turns; not run for this study.
  • 5–15% false-positive rate (same dependency).
  • Path-traversal catch in PR #52 (rejected; see §5).
  • Action ≠ ship: the 0.89 figure tracks in-session Edit/Write/git tool_uses, not merges or production deploys (caveat called out in §3).

9. Next experiments to close the gaps in §8

Each non-claim in §8 has a specific experimental design that would convert it from “not supported” to a real claim:

  • Adversarial vs non-adversarial framing. A/B prompt-template test on a controlled set of similar review tasks: same code under review, only the dispatcher prompt varies. Measure delta in finding-confirmation rate and follow-up commit rate. n ≥ 30 per arm to detect a modest effect.
  • Paired vs solo lens. Extend the sample. Reaching p ≤ 0.05 at the current effect size (Δ = 1.19 iterations) needs roughly 30 paired deploys, or a deliberate A/B on tasks where either lens fits.
  • 69% confirmed-finding rate and 5-15% false-positive rate. Both depend on classifying subagent final messages and follow-up user turns as confirm / reject / ambiguous. The 5-tuples for all 71 invocations already capture those messages; the missing step is an LLM-classification pass against a manually-labeled calibration set of ~20 invocations.
  • Action ≠ ship. Trace each invocation’s Edit / Write / git tool_uses to downstream artifacts: which commits landed on main, which PRs merged. Substrate-level work: extract commit SHAs from tool_use events, run gh pr list and git log against each repo. Replaces “deploys correlate with in-session actions” with “deploys correlate with shipped work.”

10. Falsify your own retrospective before shipping it

The path-traversal claim in §5 is what happens when a retrospective inherits a subagent’s reported finding without verifying it. The fix is to verify during the writing, not as a separate audit pass after. A claim that doesn’t survive its own trail row doesn’t appear in the study.


Reproducibility: the underlying analysis runs as four small scripts (streaming corpus scanner, classifier + dedup, 5-tuple extractor, aggregator with permutation tests). The substrate they produce is a deterministic inventory CSV, one JSON per invocation with cell-level provenance tags, an aggregate metrics file, and a falsification trail listing every cited claim with its verification source. The substrate is written in shapes that chat-arch can ingest as seed data once its curator + falsifier substrate ships. Until then, the discipline lives in the skill checklist and the falsification trail itself.