Status: Active · Fully operational production pipeline · Last verified: May 11, 2026
Evolved from ShopSmith OS, which proved the domain. ShopForge rebuilt it as a self-improving agent.
Read the full case study for the narrative behind this project, including session logs, research attribution, and early production results.
The Problem
Running a digital art print shop on Etsy means repeating the same high-friction pipeline hundreds of times: generate art, quality-check every image, write SEO copy, composite listing photos, verify Etsy compliance, publish a draft, and schedule Pinterest pins. Each step has subtle failure modes: AI-generated signatures hiding in corners, green fringe artifacts on composited frames, listings that accidentally say “handmade” instead of “AI-generated.” Doing this manually doesn’t scale. Scripting it procedurally doesn’t learn.
I wanted a system that could run the full pipeline autonomously, remember what works, and get measurably better at it over time, without fine-tuning a model.
The Approach
ShopForge is an agentic AI system built on Claude that operates a real Etsy shop called ChateaucoreWalls (French château aesthetic digital art prints). It’s not a chatbot wrapper or a prompt chain. It’s a self-improving agent with persistent memory, multi-tier quality assurance, and skill files that evolve based on production evidence.
The architecture enforces a hard separation between an intelligence layer (Claude + parallel sub-agents for reasoning, QA, and copywriting) and a data/execution layer (Node.js + SQLite for deterministic operations, image processing, and state management). The LLM reasons; the server executes. Neither crosses the boundary.
Three-Tier Memory System
The most distinctive piece of engineering. Inspired by the A-MEM research paper (arXiv:2502.12110, Feb 2025) and the “Memory in the Age of AI Agents” survey (arXiv:2512.13564, Dec 2025):
-
Episodes: Raw event records (generation runs, QA verdicts, publishing outcomes). Recorded selectively: errors always, successes only when they carry enrichment data (scores, quality dimensions). This prevents the store from bloating with low-signal noise.
-
Semantic Rules: Distilled insights with Bayesian confidence scoring (
supporting / (supporting + contradicting + 2)). The prior of 2 means a single observation can never produce high confidence; it takes ~6 supporting episodes with zero contradictions to reach 0.75. Rules are scoped across 32 domains (prompt engineering, compositor, Etsy compliance, seasonal patterns, etc.). -
Templates: Compiled, human-reviewed configurations that bundle proven prompts, generation parameters, listing copy patterns, and expected QA scores. Templates track a running-mean QA score (Knuth’s online mean algorithm) and auto-retire if actual scores fall consistently below expectations.
A single retrieval function (getRelevantContext) queries all three tiers and returns a readiness assessment: template_ready, rules_available, episodes_only, or no_data, so the agent knows exactly how much institutional knowledge exists for any given task.
Self-Improvement Loops
Three nested loops, drawing from the Self-Evolving Agents survey and EvoAgentX framework (arXiv:2508.07407) and the Reflexion framework (arXiv:2303.11366, NeurIPS 2023):
- Inner loop (per-task). Reflexion-style retry with QA feedback. If >40% of generated images fail QA, the planner adjusts prompts and regenerates within the same run.
- Middle loop (adaptive cadence). Episodes are clustered by (category, style, theme) and analyzed by a consolidation sub-agent that proposes new semantic rules. The cadence adapts to maturity: every product early on, every 5 products once the shop is established. Evidence counts are clamped to prevent the LLM from hallucinating inflated confidence.
- Outer loop (strategic). A metacognition skill, structured around the ICML 2025 position paper on intrinsic metacognitive learning (arXiv:2506.05109), generates improvement curves, identifies strengths/weaknesses by style, and proposes strategic pivots. The user makes the final call.
A fourth mechanism, skill evolution, can promote high-confidence rules directly into the skill markdown files, permanently improving the agent’s behavior. This is a deliberate action (not automatic): after consolidation, the system scans for graduation candidates and proposes changes. Low-risk edits can auto-apply; structural changes require human approval. Every graduation is logged with a diff and is rollback-capable via backup files.
Intent-Based Skills (Not Procedural Scripts)
20+ skills define purpose, quality rubrics, available tools, decision gates, and output contracts, not step-by-step procedures. Claude derives the implementation. A v1-to-v2 refactoring achieved a 74% line reduction (6,163 → 1,613 lines across 7 core skills) while preserving all quality-critical domain knowledge.
Skills encode hard-won operational lessons. Examples:
- The corner crop signature check (800×800 crop of each corner, every image) catches AI watermarks that full-image review misses ~30% of the time
- The multi-pass defringe pipeline eliminates green-screen compositing artifacts using three simultaneous color detection methods (RGB ratio, HSV, relative green excess) because any single method misses edge cases
- The hero photo diversity gate enforces round-robin rotation across 5 frame/room variations with a minimum-separation constraint, so the shop grid never looks templated
- “Never auto-publish” appears as an unbreakable rule in 8 separate skill files; the redundancy is intentional
Graduated Autonomy
Decision gates track their own approval history. The system stores approval rates and precedent counts per gate type, enabling skills to recommend auto-approval for decisions with strong track records while always requiring human approval for safety-critical gates (publishing, compliance). Autonomy is earned through demonstrated competence, not configured by default.
What’s Implemented
Full production pipeline:
- AI image generation via Gemini API with learned prompt optimization
- Multi-pass green-screen compositing (frame mockups with chroma-key art placement)
- Three-tier QA: visual inspection + corner crop analysis + persona-based buyer evaluation (3 buyer personas with full psychographic profiles) + Etsy compliance verification
- SEO copywriting with brand voice consistency
- Automated Etsy draft creation (never auto-publishes; human publishes manually)
- Pinterest pin scheduling with configurable content mix targets
Infrastructure:
- ~95 server functions callable via direct import (bypassed MCP protocol for reliability)
- SQLite in WAL mode with 14 tables (crash-safe, zero-config, survives session interruptions)
- Atomic state writes (temp file + rename) throughout
- A/B experiment framework with composite engagement scoring (sales weighted 5× over favorites, 15% signal threshold for winner declaration)
- Live dashboard with Server-Sent Events
- Session state continuity: any session can resume exactly where the last one stopped
Tech stack: TypeScript/Node.js, SQLite (better-sqlite3), Google Gemini API, sharp (image processing), RealESRGAN (upscaling, with sharp fallback), Claude (intelligence layer)
Initial 10-Day Results (Feb–Mar 2026)
Snapshot from the first 10 days of operation on the ChateaucoreWalls Etsy shop. ShopForge has been running since late February 2026; the numbers below are the cold-start window, kept here for honesty about how the architecture’s Bayesian priors behave with zero history.
What’s real and working:
- 17 published listings across 5 styles (watercolor, classical oil painting, botanical illustration, oil painting, fine art photography)
- 68 episodes recorded across 8 types, feeding the memory system
- 15 semantic rules with Bayesian confidence scores ranging from 0.00 (axioms awaiting evidence) to 0.82 (Pinterest queue ordering, learned from FK constraint failures)
- Rules derived from real production failures, including a consolidation API bug that produced a rule about field naming conventions, and a size-claim verification rule after listing copy overstated print dimensions
- Early traffic data: top performer at 25 views and 4 favorites in 2 days (Chateaucore Watercolor 20-Pack)
- Average rule confidence climbing: image generation scope rose from 0.33–0.60 to 0.74 average; operational rules reached 0.99
What’s still in cold start:
- Zero sales and zero revenue (shop is ~10 days old)
- Persona calibration has no predicted-vs-actual feedback loop yet (requires sales data)
- No A/B experiments created yet
- Improvement curves lack structured QA scores for trend generation (QA verdicts were recorded as PASS/FAIL, not numeric)
- Gate enforcement logging hasn’t been called through the formal workflow yet
The architecture predicts a cold start period. Bayesian priors are designed to require ~6 supporting episodes with zero contradictions before a rule reaches 0.75 confidence. The system is honest about what it doesn’t know yet.
What This Demonstrates
- Agentic architecture design: A production system where an LLM operates as a reasoning engine within a structured execution framework, not as a chatbot
- Self-improving systems without fine-tuning: Three-tier memory with Bayesian confidence, automated consolidation, and skill evolution that permanently improves agent behavior based on production evidence
- Research-to-production translation: Concepts from 9 cited research papers and frameworks (A-MEM, Memory Survey, Reflexion, EvoAgentX, ICML 2025 metacognition, SICA, PersonaCite, Nakajima, Microsoft ISE) implemented as working software
- Trust engineering: Graduated autonomy gates, evidence clamping to prevent hallucinated confidence, human gates on templates, 5-layer publish safety, mandatory AI disclosure verification
- Domain knowledge encoding: Converting operational lessons (corner crop checks, defringe pipelines, hero diversity) into system constraints that prevent regression
- Full-stack AI product development: Image generation through marketplace listing, with quality assurance, compliance, marketing, and analytics in a single coherent system