Status: Active · Running daily across 65 local projects · Last verified: June 13, 2026
Read the design decisions behind Command for the full narrative: the eval gates built before the engine, the billing-driven architecture split, and the pre-registered proof loop.
The Problem
I run dozens of projects through Claude Code, and the leverage is supposed to be obvious: one operator, taught once, improves every project at once. The risk is just as obvious. A self-improving system’s native failure mode is self-delusion, proposing fixes that sound right and claiming credit nothing verifies. I wanted the leverage without the lying: a system that watches the whole fleet, proposes the smallest config change that would prevent a recurring problem, and then has to prove the change moved a real number before it is allowed to call itself done.
The Approach
Command is a deterministic watcher plus an approval-gated orchestrator, built on Claude Code.
The watcher is plain code on a three-minute cycle, with no language model in the loop. It reads git status, GitHub Actions, service probes, and the session trail Claude Code and Cowork leave behind; ranks every project by its worst current signal; and keeps a per-project metric history about 24 hours deep. Deciding which project needs attention turned out to need no model at all.
The orchestrator is the interactive Claude Code session I open myself. It is the only part that reasons and the only part that executes anything with side effects. On that skeleton runs a signal engine: read the fleet’s friction, propose a fix to a project’s Claude Code configuration as a reviewable pull request, and measure whether it worked.
Three kinds of durable state sit between the halves:
- Objectives: outcomes I hand it, each with an autonomy dial and living state that survives across sessions.
- Signals: recurring friction observed bottom-up, walking
new → corroborated → proposed → applied → measuring → proven → retired, withdidn't moveandinconclusiveas recorded dead ends. - Directives: standing rules rolled out top-down, walking
candidate → piloting → proven → fleet-wide. The only edge into fleet-wide isproven → fleet-wide.
Key Design Decisions
- No model in the watching half. A ranking I can read the code for never has an off day, and the continuous loop stays free of metered API usage. The reasoning lives only in the interactive session.
- Author per-repo config, don’t load it. A target repo’s
CLAUDE.mdand.claude/are deliberately not loaded (settingSources: []), so an untrusted repo can’t inject instructions into the agent working on it. Command stopped consuming per-repo configuration and started authoring it. The config surface became the product. - No engine until the evals passed. Five pre-registered go/no-go gates had to pass before any engine code existed. The hard one, a backtest, failed four runs straight and exposed a broken metric (an unreachable ceiling), not a broken engine. It was caught only because the bar was committed before the data was seen.
- Every grader proves its error rate on noise first. No measurement path judges a real fix until it has demonstrated its false-positive rate on synthetic no-effect data. Naive consecutive-poll comparisons false-prove 37–42% of the time; the shipped recipe, readings spaced hours apart, holds false-proven at or under 1.5%.
- Narrow apply, earned reach. A fix is a pull request, only into repos I own, touching only
.claude/andCLAUDE.md, capped at 200 changed lines, never auto-merged. Nothing reaches all 65 projects on a hunch. - Four hard walls in deterministic code. Spend money, message an unknown person, publish or go public, or cause irreversible data loss: each requires a separate explicit confirmation, enforced by a gate function with tests behind it, not by the model’s judgment.
What’s Implemented
- A free, model-free watcher ranking 65 projects on a three-minute cycle, feeding a per-project metric history.
- The full signal and directive lifecycles, with pre-registration: which metric should move, in which direction, by how much, within what window, all committed to git before the fix runs, one attempt per registration.
- Autocorrelation-aware measurement: readings spaced at least two hours apart, certified to hold false-proven at or under 1.5% with high detection across its range; metrics outside that envelope require a pre-registered absolute floor instead.
- A standing instrument-validity audit (ceiling, denominator, operating-point, run-variance) that runs before any gate verdict is trusted.
- The four walls as deterministic gates with tests, wired into the tool-call path.
What This Demonstrates
- Systems design under adversarial self-measurement: building a self-improving loop whose first design goal is to not fool itself.
- Eval rigor and pre-registration: success criteria committed before the data, honest “didn’t move” verdicts recorded rather than hidden, broken metrics diagnosed before broken work is blamed.
- Trust engineering and blast-radius control: narrow, reversible, reviewed changes; earned reach; hard walls in code.
- Cost-aware architecture: splitting the continuous, deterministic half from the metered, reasoning half in response to a real billing constraint.
The architecture and the proof discipline are in place and running daily. The measured verdicts (does a graded fix actually earn its reach, do proven fixes accumulate across the fleet faster than I could route them by hand) are what the next chapter is about.