Command | Bryce Watson

Status: Active · Running daily across 65 local projects · Last verified: June 19, 2026

Read the design decisions behind Command for the full narrative: the eval gates built before the engine, the billing-driven architecture split, and the pre-registered proof loop.

The Problem

I run dozens of projects through Claude Code, and the leverage is supposed to be obvious: one operator, taught once, improves every project at once. The risk is just as obvious. A self-improving system’s native failure mode is self-delusion, proposing fixes that sound right and claiming credit nothing verifies. I wanted the leverage without the lying: a system that watches the whole fleet, proposes the smallest config change that would prevent a recurring problem, and then has to prove the change moved a real number before it is allowed to call itself done.

The Approach

Command is a deterministic watcher plus an approval-gated orchestrator, built on Claude Code.

The watcher is plain code on a three-minute cycle. It reads git status, GitHub Actions, service probes, and the session trail Claude Code and Cowork leave behind; ranks every project by its worst current signal; and keeps a per-project metric history about 24 hours deep. No language model is in that loop, with one exception: a single bounded Haiku call writes each project’s status prose, and it has no role in the ranking. Deciding which project needs attention turned out to need no model at all.

The orchestrator is the interactive Claude Code session I open myself. It is the only part that reasons and the only part that executes gated actions. On that skeleton runs a signal engine: read the fleet’s friction, apply a fix to Command’s own configuration or file a requirements issue for the project that owns the code, and measure whether it moved a real number.

Three kinds of durable state sit between the halves:

Objectives: outcomes I hand it, each with an autonomy dial and living state that survives across sessions.
Signals: recurring friction observed bottom-up, walking new → corroborated → proposed → applied or delegated → measuring → proven → retired (a fix for one of my own repos is applied; one for another project is delegated to it as an issue), with didn't move, inconclusive, and remediation-resistant (after three failed attempts) as recorded dead ends.
Directives: standing rules rolled out top-down, walking candidate → piloting → proven → fleet-wide. The only edge into fleet-wide is proven → fleet-wide.

Key Design Decisions

No model in the watching half. A ranking I can read the code for never has an off day, and deciding what needs attention costs no metered API calls. The reasoning lives only in the interactive session.
Author per-repo config, don’t load it. A target repo’s CLAUDE.md and .claude/ are deliberately not loaded (settingSources: []), so an untrusted repo can’t inject instructions into the agent working on it. Command stopped consuming per-repo configuration and started authoring it. The config surface became the product.
No engine until the evals passed. Five pre-registered go/no-go gates had to pass before any engine code existed. The hard one, a backtest, failed four runs straight and exposed a broken metric (an unreachable ceiling), not a broken engine. It was caught only because the bar was committed before the data was seen.
Every grader proves its error rate on noise first. No measurement path judges a real fix until it has demonstrated its false-positive rate on synthetic no-effect data. Naive consecutive-poll comparisons false-prove 37–42% of the time; the shipped recipe, readings spaced hours apart, holds false-proven at or under 1.5%.
Narrow apply, earned reach. Command changes only its own configuration directly; for any other project it files a requirements issue and leaves the implementation to the session that owns that repo’s context. Reach is earned, not assumed: a standing rule only goes fleet-wide by walking candidate → piloting → proven. Nothing reaches all 65 projects on a hunch.
Four hard walls in deterministic code. Spend money, message an unknown person, publish or go public, or cause irreversible data loss: each requires a separate explicit confirmation, enforced by a gate function with tests behind it, not by the model’s judgment.

What’s Implemented

A free, model-free watcher ranking 65 projects on a three-minute cycle, feeding a per-project metric history.
The full signal and directive lifecycles, with pre-registration: which metric should move, in which direction, by how much, within what window, all committed to git before the fix runs, one attempt per registration.
Autocorrelation-aware measurement: readings spaced at least two hours apart, certified to hold false-proven at or under 1.5% with high detection across its range; metrics outside that envelope require a pre-registered absolute floor instead.
A standing instrument-validity audit (ceiling, denominator, operating-point, run-variance) that runs before any gate verdict is trusted.
The four walls as deterministic gates with tests, wired into the tool-call path.

What This Demonstrates

Systems design under adversarial self-measurement: building a self-improving loop whose first design goal is to not fool itself.
Eval rigor and pre-registration: success criteria committed before the data, honest “didn’t move” verdicts recorded rather than hidden, broken metrics diagnosed before broken work is blamed.
Trust engineering and blast-radius control: narrow, reversible, reviewed changes; earned reach; hard walls in code.
Cost-aware architecture: splitting the continuous, deterministic half from the metered, reasoning half in response to a real billing constraint.

The architecture and the proof discipline are in place and running daily. The measured verdicts (does a graded fix actually earn its reach, do proven fixes accumulate across the fleet faster than I could route them by hand) are what the next chapter is about.