Status: Active · Fully operational production pipeline · Last verified: Mar 8, 2026
Evolved from ShopSmith OS, which proved the domain. ShopForge rebuilt it as a self-improving agent.
Read the full case study for the narrative behind this project, including session logs, research attribution, and early production results.
The Problem
Running a digital art print shop on Etsy means repeating the same high-friction pipeline hundreds of times: generate art, quality-check every image, write SEO copy, composite listing photos, verify Etsy compliance, publish a draft, and schedule Pinterest pins. Each step has subtle failure modes: AI-generated signatures hiding in corners, green fringe artifacts on composited frames, listings that accidentally say “handmade” instead of “AI-generated.” Doing this manually doesn’t scale. Scripting it procedurally doesn’t learn.
I wanted a system that could run the full pipeline autonomously, remember what works, and get measurably better at it over time, without fine-tuning a model.
The Approach
ShopForge is an agentic AI system built on Claude that operates a real Etsy shop called ChateaucoreWalls (French château aesthetic digital art prints). It’s not a chatbot wrapper or a prompt chain. It’s a self-improving agent with persistent memory, multi-tier quality assurance, and skill files that evolve based on production evidence.
The architecture enforces a hard separation between an intelligence layer (Claude + parallel sub-agents for reasoning, QA, and copywriting) and a data/execution layer (Node.js + SQLite for deterministic operations, image processing, and state management). The LLM reasons; the server executes. Neither crosses the boundary.
Three-Tier Memory System
The most distinctive piece of engineering. Inspired by the A-MEM research paper (arXiv:2502.12110, Feb 2025) and the “Memory in the Age of AI Agents” survey (arXiv:2512.13564, Dec 2025):
-
Episodes: Raw event records (generation runs, QA verdicts, publishing outcomes). Recorded selectively: errors always, successes only when they carry enrichment data (scores, quality dimensions). This prevents the store from bloating with low-signal noise.
-
Semantic Rules: Distilled insights with Bayesian confidence scoring (
supporting / (supporting + contradicting + 2)). The prior of 2 means a single observation can never produce high confidence; it takes ~6 supporting episodes with zero contradictions to reach 0.75. Rules are scoped across 32 domains (prompt engineering, compositor, Etsy compliance, seasonal patterns, etc.). -
Templates: Compiled, human-reviewed configurations that bundle proven prompts, generation parameters, listing copy patterns, and expected QA scores. Templates track a running-mean QA score (Knuth’s online mean algorithm) and auto-retire if actual scores fall consistently below expectations.
A single retrieval function (getRelevantContext) queries all three tiers and returns a readiness assessment: template_ready, rules_available, episodes_only, or no_data, so the agent knows exactly how much institutional knowledge exists for any given task.
Self-Improvement Loops
Three nested loops, drawing from the Self-Evolving Agents survey and EvoAgentX framework (arXiv:2508.07407) and the Reflexion framework (arXiv:2303.11366, NeurIPS 2023):
- Inner loop (per-task). Reflexion-style retry with QA feedback. If >40% of generated images fail QA, the planner adjusts prompts and regenerates within the same run.
- Middle loop (adaptive cadence). Episodes are clustered by (category, style, theme) and analyzed by a consolidation sub-agent that proposes new semantic rules. The cadence adapts to maturity: every product early on, every 5 products once the shop is established. Evidence counts are clamped to prevent the LLM from hallucinating inflated confidence.
- Outer loop (strategic). A metacognition skill, structured around the ICML 2025 position paper on intrinsic metacognitive learning (arXiv:2506.05109), generates improvement curves, identifies strengths/weaknesses by style, and proposes strategic pivots. The user makes the final call.
A fourth mechanism, skill evolution, can promote high-confidence rules directly into the skill markdown files, permanently improving the agent’s behavior. This is a deliberate action (not automatic): after consolidation, the system scans for graduation candidates and proposes changes. Low-risk edits can auto-apply; structural changes require human approval. Every graduation is logged with a diff and is rollback-capable via backup files.
Intent-Based Skills (Not Procedural Scripts)
20+ skills define purpose, quality rubrics, available tools, decision gates, and output contracts, not step-by-step procedures. Claude derives the implementation. A v1-to-v2 refactoring achieved a 74% line reduction (6,163 → 1,613 lines across 7 core skills) while preserving all quality-critical domain knowledge.
Skills encode hard-won operational lessons. Examples:
- The corner crop signature check (800×800 crop of each corner, every image) catches AI watermarks that full-image review misses ~30% of the time
- The multi-pass defringe pipeline eliminates green-screen compositing artifacts using three simultaneous color detection methods (RGB ratio, HSV, relative green excess) because any single method misses edge cases
- The hero photo diversity gate enforces round-robin rotation across 5 frame/room variations with a minimum-separation constraint, so the shop grid never looks templated
- “Never auto-publish” appears as an unbreakable rule in 8 separate skill files; the redundancy is intentional
Graduated Autonomy
Decision gates track their own approval history. The system stores approval rates and precedent counts per gate type, enabling skills to recommend auto-approval for decisions with strong track records while always requiring human approval for safety-critical gates (publishing, compliance). Autonomy is earned through demonstrated competence, not configured by default.
What’s Implemented
Full production pipeline:
- AI image generation via Gemini API with learned prompt optimization
- Multi-pass green-screen compositing (frame mockups with chroma-key art placement)
- Three-tier QA: visual inspection + corner crop analysis + persona-based buyer evaluation (3 buyer personas with full psychographic profiles) + Etsy compliance verification
- SEO copywriting with brand voice consistency
- Automated Etsy draft creation (never auto-publishes; human publishes manually)
- Pinterest pin scheduling with configurable content mix targets
Infrastructure:
- ~95 server functions callable via direct import (bypassed MCP protocol for reliability)
- SQLite in WAL mode with 14 tables (crash-safe, zero-config, survives session interruptions)
- Atomic state writes (temp file + rename) throughout
- A/B experiment framework with composite engagement scoring (sales weighted 5× over favorites, 15% signal threshold for winner declaration)
- Live dashboard with Server-Sent Events
- Session state continuity: any session can resume exactly where the last one stopped
Tech stack: TypeScript/Node.js, SQLite (better-sqlite3), Google Gemini API, sharp (image processing), RealESRGAN (upscaling, with sharp fallback), Claude (intelligence layer)
Early Results (10 Days of Production)
ShopForge has been operating the ChateaucoreWalls Etsy shop since late February 2026. This is deliberately transparent about what “early” means:
What’s real and working:
- 17 published listings across 5 styles (watercolor, classical oil painting, botanical illustration, oil painting, fine art photography)
- 68 episodes recorded across 8 types, feeding the memory system
- 15 semantic rules with Bayesian confidence scores ranging from 0.00 (axioms awaiting evidence) to 0.82 (Pinterest queue ordering, learned from FK constraint failures)
- Rules derived from real production failures, including a consolidation API bug that produced a rule about field naming conventions, and a size-claim verification rule after listing copy overstated print dimensions
- Early traffic data: top performer at 25 views and 4 favorites in 2 days (Chateaucore Watercolor 20-Pack)
- Average rule confidence climbing: image generation scope rose from 0.33–0.60 to 0.74 average; operational rules reached 0.99
What’s still in cold start:
- Zero sales and zero revenue (shop is ~10 days old)
- Persona calibration has no predicted-vs-actual feedback loop yet (requires sales data)
- No A/B experiments created yet
- Improvement curves lack structured QA scores for trend generation (QA verdicts were recorded as PASS/FAIL, not numeric)
- Gate enforcement logging hasn’t been called through the formal workflow yet
The architecture predicts a cold start period. Bayesian priors are designed to require ~6 supporting episodes with zero contradictions before a rule reaches 0.75 confidence. The system is honest about what it doesn’t know yet.
What This Demonstrates
- Agentic architecture design: A production system where an LLM operates as a reasoning engine within a structured execution framework, not as a chatbot
- Self-improving systems without fine-tuning: Three-tier memory with Bayesian confidence, automated consolidation, and skill evolution that permanently improves agent behavior based on production evidence
- Research-to-production translation: Concepts from 9 cited research papers and frameworks (A-MEM, Memory Survey, Reflexion, EvoAgentX, ICML 2025 metacognition, SICA, PersonaCite, Nakajima, Microsoft ISE) implemented as working software
- Trust engineering: Graduated autonomy gates, evidence clamping to prevent hallucinated confidence, human gates on templates, 5-layer publish safety, mandatory AI disclosure verification
- Domain knowledge encoding: Converting operational lessons (corner crop checks, defringe pipelines, hero diversity) into system constraints that prevent regression
- Full-stack AI product development: Image generation through marketplace listing, with quality assurance, compliance, marketing, and analytics in a single coherent system