8 min read
ShopForge

Status: Active · Fully operational production pipeline · Last verified: May 11, 2026

Evolved from ShopSmith OS, which proved the domain. ShopForge rebuilt it as a self-improving agent.

Read the full case study for the narrative behind this project, including session logs, research attribution, and early production results.

The Problem

Running a digital art print shop on Etsy means repeating the same high-friction pipeline hundreds of times: generate art, quality-check every image, write SEO copy, composite listing photos, verify Etsy compliance, publish a draft, and schedule Pinterest pins. Each step has subtle failure modes: AI-generated signatures hiding in corners, green fringe artifacts on composited frames, listings that accidentally say “handmade” instead of “AI-generated.” Doing this manually doesn’t scale. Scripting it procedurally doesn’t learn.

I wanted a system that could run the full pipeline autonomously, remember what works, and get measurably better at it over time, without fine-tuning a model.

The Approach

ShopForge is an agentic AI system built on Claude that operates a real Etsy shop called ChateaucoreWalls (French château aesthetic digital art prints). It’s not a chatbot wrapper or a prompt chain. It’s a self-improving agent with persistent memory, multi-tier quality assurance, and skill files that evolve based on production evidence.

The architecture enforces a hard separation between an intelligence layer (Claude + parallel sub-agents for reasoning, QA, and copywriting) and a data/execution layer (Node.js + SQLite for deterministic operations, image processing, and state management). The LLM reasons; the server executes. Neither crosses the boundary.

Three-Tier Memory System

The most distinctive piece of engineering. Inspired by the A-MEM research paper (arXiv:2502.12110, Feb 2025) and the “Memory in the Age of AI Agents” survey (arXiv:2512.13564, Dec 2025):

  1. Episodes: Raw event records (generation runs, QA verdicts, publishing outcomes). Recorded selectively: errors always, successes only when they carry enrichment data (scores, quality dimensions). This prevents the store from bloating with low-signal noise.

  2. Semantic Rules: Distilled insights with Bayesian confidence scoring (supporting / (supporting + contradicting + 2)). The prior of 2 means a single observation can never produce high confidence; it takes ~6 supporting episodes with zero contradictions to reach 0.75. Rules are scoped across 32 domains (prompt engineering, compositor, Etsy compliance, seasonal patterns, etc.).

  3. Templates: Compiled, human-reviewed configurations that bundle proven prompts, generation parameters, listing copy patterns, and expected QA scores. Templates track a running-mean QA score (Knuth’s online mean algorithm) and auto-retire if actual scores fall consistently below expectations.

A single retrieval function (getRelevantContext) queries all three tiers and returns a readiness assessment: template_ready, rules_available, episodes_only, or no_data, so the agent knows exactly how much institutional knowledge exists for any given task.

Self-Improvement Loops

Three nested loops, drawing from the Self-Evolving Agents survey and EvoAgentX framework (arXiv:2508.07407) and the Reflexion framework (arXiv:2303.11366, NeurIPS 2023):

  • Inner loop (per-task). Reflexion-style retry with QA feedback. If >40% of generated images fail QA, the planner adjusts prompts and regenerates within the same run.
  • Middle loop (adaptive cadence). Episodes are clustered by (category, style, theme) and analyzed by a consolidation sub-agent that proposes new semantic rules. The cadence adapts to maturity: every product early on, every 5 products once the shop is established. Evidence counts are clamped to prevent the LLM from hallucinating inflated confidence.
  • Outer loop (strategic). A metacognition skill, structured around the ICML 2025 position paper on intrinsic metacognitive learning (arXiv:2506.05109), generates improvement curves, identifies strengths/weaknesses by style, and proposes strategic pivots. The user makes the final call.

A fourth mechanism, skill evolution, can promote high-confidence rules directly into the skill markdown files, permanently improving the agent’s behavior. This is a deliberate action (not automatic): after consolidation, the system scans for graduation candidates and proposes changes. Low-risk edits can auto-apply; structural changes require human approval. Every graduation is logged with a diff and is rollback-capable via backup files.

Intent-Based Skills (Not Procedural Scripts)

20+ skills define purpose, quality rubrics, available tools, decision gates, and output contracts, not step-by-step procedures. Claude derives the implementation. A v1-to-v2 refactoring achieved a 74% line reduction (6,163 → 1,613 lines across 7 core skills) while preserving all quality-critical domain knowledge.

Skills encode hard-won operational lessons. Examples:

  • The corner crop signature check (800×800 crop of each corner, every image) catches AI watermarks that full-image review misses ~30% of the time
  • The multi-pass defringe pipeline eliminates green-screen compositing artifacts using three simultaneous color detection methods (RGB ratio, HSV, relative green excess) because any single method misses edge cases
  • The hero photo diversity gate enforces round-robin rotation across 5 frame/room variations with a minimum-separation constraint, so the shop grid never looks templated
  • “Never auto-publish” appears as an unbreakable rule in 8 separate skill files; the redundancy is intentional

Graduated Autonomy

Decision gates track their own approval history. The system stores approval rates and precedent counts per gate type, enabling skills to recommend auto-approval for decisions with strong track records while always requiring human approval for safety-critical gates (publishing, compliance). Autonomy is earned through demonstrated competence, not configured by default.

What’s Implemented

Full production pipeline:

  • AI image generation via Gemini API with learned prompt optimization
  • Multi-pass green-screen compositing (frame mockups with chroma-key art placement)
  • Three-tier QA: visual inspection + corner crop analysis + persona-based buyer evaluation (3 buyer personas with full psychographic profiles) + Etsy compliance verification
  • SEO copywriting with brand voice consistency
  • Automated Etsy draft creation (never auto-publishes; human publishes manually)
  • Pinterest pin scheduling with configurable content mix targets

Infrastructure:

  • ~95 server functions callable via direct import (bypassed MCP protocol for reliability)
  • SQLite in WAL mode with 14 tables (crash-safe, zero-config, survives session interruptions)
  • Atomic state writes (temp file + rename) throughout
  • A/B experiment framework with composite engagement scoring (sales weighted 5× over favorites, 15% signal threshold for winner declaration)
  • Live dashboard with Server-Sent Events
  • Session state continuity: any session can resume exactly where the last one stopped

Tech stack: TypeScript/Node.js, SQLite (better-sqlite3), Google Gemini API, sharp (image processing), RealESRGAN (upscaling, with sharp fallback), Claude (intelligence layer)

Initial 10-Day Results (Feb–Mar 2026)

Snapshot from the first 10 days of operation on the ChateaucoreWalls Etsy shop. ShopForge has been running since late February 2026; the numbers below are the cold-start window, kept here for honesty about how the architecture’s Bayesian priors behave with zero history.

What’s real and working:

  • 17 published listings across 5 styles (watercolor, classical oil painting, botanical illustration, oil painting, fine art photography)
  • 68 episodes recorded across 8 types, feeding the memory system
  • 15 semantic rules with Bayesian confidence scores ranging from 0.00 (axioms awaiting evidence) to 0.82 (Pinterest queue ordering, learned from FK constraint failures)
  • Rules derived from real production failures, including a consolidation API bug that produced a rule about field naming conventions, and a size-claim verification rule after listing copy overstated print dimensions
  • Early traffic data: top performer at 25 views and 4 favorites in 2 days (Chateaucore Watercolor 20-Pack)
  • Average rule confidence climbing: image generation scope rose from 0.33–0.60 to 0.74 average; operational rules reached 0.99

What’s still in cold start:

  • Zero sales and zero revenue (shop is ~10 days old)
  • Persona calibration has no predicted-vs-actual feedback loop yet (requires sales data)
  • No A/B experiments created yet
  • Improvement curves lack structured QA scores for trend generation (QA verdicts were recorded as PASS/FAIL, not numeric)
  • Gate enforcement logging hasn’t been called through the formal workflow yet

The architecture predicts a cold start period. Bayesian priors are designed to require ~6 supporting episodes with zero contradictions before a rule reaches 0.75 confidence. The system is honest about what it doesn’t know yet.

What This Demonstrates

  • Agentic architecture design: A production system where an LLM operates as a reasoning engine within a structured execution framework, not as a chatbot
  • Self-improving systems without fine-tuning: Three-tier memory with Bayesian confidence, automated consolidation, and skill evolution that permanently improves agent behavior based on production evidence
  • Research-to-production translation: Concepts from 9 cited research papers and frameworks (A-MEM, Memory Survey, Reflexion, EvoAgentX, ICML 2025 metacognition, SICA, PersonaCite, Nakajima, Microsoft ISE) implemented as working software
  • Trust engineering: Graduated autonomy gates, evidence clamping to prevent hallucinated confidence, human gates on templates, 5-layer publish safety, mandatory AI disclosure verification
  • Domain knowledge encoding: Converting operational lessons (corner crop checks, defringe pipelines, hero diversity) into system constraints that prevent regression
  • Full-stack AI product development: Image generation through marketplace listing, with quality assurance, compliance, marketing, and analytics in a single coherent system