5 min read
Command

Status: Active · Running daily across 65 local projects · Last verified: June 13, 2026

Read the design decisions behind Command for the full narrative: the eval gates built before the engine, the billing-driven architecture split, and the pre-registered proof loop.

The Problem

I run dozens of projects through Claude Code, and the leverage is supposed to be obvious: one operator, taught once, improves every project at once. The risk is just as obvious. A self-improving system’s native failure mode is self-delusion, proposing fixes that sound right and claiming credit nothing verifies. I wanted the leverage without the lying: a system that watches the whole fleet, proposes the smallest config change that would prevent a recurring problem, and then has to prove the change moved a real number before it is allowed to call itself done.

The Approach

Command is a deterministic watcher plus an approval-gated orchestrator, built on Claude Code.

The watcher is plain code on a three-minute cycle, with no language model in the loop. It reads git status, GitHub Actions, service probes, and the session trail Claude Code and Cowork leave behind; ranks every project by its worst current signal; and keeps a per-project metric history about 24 hours deep. Deciding which project needs attention turned out to need no model at all.

The orchestrator is the interactive Claude Code session I open myself. It is the only part that reasons and the only part that executes anything with side effects. On that skeleton runs a signal engine: read the fleet’s friction, propose a fix to a project’s Claude Code configuration as a reviewable pull request, and measure whether it worked.

Three kinds of durable state sit between the halves:

  • Objectives: outcomes I hand it, each with an autonomy dial and living state that survives across sessions.
  • Signals: recurring friction observed bottom-up, walking new → corroborated → proposed → applied → measuring → proven → retired, with didn't move and inconclusive as recorded dead ends.
  • Directives: standing rules rolled out top-down, walking candidate → piloting → proven → fleet-wide. The only edge into fleet-wide is proven → fleet-wide.

Key Design Decisions

  • No model in the watching half. A ranking I can read the code for never has an off day, and the continuous loop stays free of metered API usage. The reasoning lives only in the interactive session.
  • Author per-repo config, don’t load it. A target repo’s CLAUDE.md and .claude/ are deliberately not loaded (settingSources: []), so an untrusted repo can’t inject instructions into the agent working on it. Command stopped consuming per-repo configuration and started authoring it. The config surface became the product.
  • No engine until the evals passed. Five pre-registered go/no-go gates had to pass before any engine code existed. The hard one, a backtest, failed four runs straight and exposed a broken metric (an unreachable ceiling), not a broken engine. It was caught only because the bar was committed before the data was seen.
  • Every grader proves its error rate on noise first. No measurement path judges a real fix until it has demonstrated its false-positive rate on synthetic no-effect data. Naive consecutive-poll comparisons false-prove 37–42% of the time; the shipped recipe, readings spaced hours apart, holds false-proven at or under 1.5%.
  • Narrow apply, earned reach. A fix is a pull request, only into repos I own, touching only .claude/ and CLAUDE.md, capped at 200 changed lines, never auto-merged. Nothing reaches all 65 projects on a hunch.
  • Four hard walls in deterministic code. Spend money, message an unknown person, publish or go public, or cause irreversible data loss: each requires a separate explicit confirmation, enforced by a gate function with tests behind it, not by the model’s judgment.

What’s Implemented

  • A free, model-free watcher ranking 65 projects on a three-minute cycle, feeding a per-project metric history.
  • The full signal and directive lifecycles, with pre-registration: which metric should move, in which direction, by how much, within what window, all committed to git before the fix runs, one attempt per registration.
  • Autocorrelation-aware measurement: readings spaced at least two hours apart, certified to hold false-proven at or under 1.5% with high detection across its range; metrics outside that envelope require a pre-registered absolute floor instead.
  • A standing instrument-validity audit (ceiling, denominator, operating-point, run-variance) that runs before any gate verdict is trusted.
  • The four walls as deterministic gates with tests, wired into the tool-call path.

What This Demonstrates

  • Systems design under adversarial self-measurement: building a self-improving loop whose first design goal is to not fool itself.
  • Eval rigor and pre-registration: success criteria committed before the data, honest “didn’t move” verdicts recorded rather than hidden, broken metrics diagnosed before broken work is blamed.
  • Trust engineering and blast-radius control: narrow, reversible, reviewed changes; earned reach; hard walls in code.
  • Cost-aware architecture: splitting the continuous, deterministic half from the metered, reasoning half in response to a real billing constraint.

The architecture and the proof discipline are in place and running daily. The measured verdicts (does a graded fix actually earn its reach, do proven fixes accumulate across the fleet faster than I could route them by hand) are what the next chapter is about.