Four of the first six consolidation attempts failed.
The error was NOT NULL constraint failed: semantic_rules.condition. The consolidation sub-agent was returning a rule field instead of the separate condition and insight fields that the database schema required. The system recorded each failure as an episode. On the next consolidation run, those failure episodes were clustered, analyzed, and distilled into a semantic rule: “The consolidation API requires a separate ‘condition’ field; do not use ‘rule’ alone.”
The system had taught itself how to call its own API correctly.
That moment, a production failure becoming a persistent, evidence-backed lesson that prevents future failures, is the core idea behind ShopForge. Not a chatbot, not a prompt chain, not a script that runs the same way every time. A system that remembers what went wrong, distills it into knowledge, and uses that knowledge to get better.
ShopForge operates a real Etsy digital art print shop called ChateaucoreWalls. It generates images, runs multi-tier quality assurance, writes SEO listing copy, composites mockup photos, creates Etsy drafts, and schedules Pinterest pins. It has been running for 10 days, published 17 real listings, and is still in its cold start period, which is part of the story.
This case study covers the architecture, the research that informed it, the session moments that shaped it, and an honest assessment of what’s working and what isn’t.
The Problem That Killed v1
ShopForge’s predecessor, ShopSmith OS, was a full-stack React + Express application with a 9-step guided publish workflow. It worked. Across 86 commits and 37 merged PRs in January 2026, it proved that AI-powered content generation, structured quality gates, and local-first data ownership could meaningfully reduce the friction of running a digital print shop.
But ShopSmith didn’t learn. Every listing started from scratch with no memory of which prompts produced better images, no awareness of which copy patterns drove more favorites, no ability to adapt its own behavior based on outcomes. It was a tool, not an agent.
The breaking point came in a session on February 25, 2026. I asked Claude to research the latest in agentic self-improving systems, then brainstorm how we might reimagine ShopSmith from first principles. The diagnosis was blunt:
The system was built as “server does work, skills tell Claude which server functions to call in what order.” But Claude in Cowork is capable of reasoning, seeing images, editing files, and making judgment calls natively. The server should only exist where Claude genuinely can’t do the work.
The numbers confirmed it: an 8,750-line monolithic server doing three jobs (only one needed to be a server), 11,324 lines of procedural skill markdown that reduced Claude to a script executor instead of an agent that reasons, and an MCP transport protocol that was unreliable inside the Cowork environment. ShopSmith was architecturally incapable of self-improvement.
The decision was immediate. Not a refactor. A rebuild from first principles.
ShopSmith v1 was a craftsman: it hammered out listings one at a time. V2 is something fundamentally different. It’s a system that learns, remembers, coordinates multiple agents, and gets better with every pack.
The new project name will be ShopForge.
Architecture: Two Layers, Hard Boundary
ShopForge was built in a 12-hour sprint across 7 phases (February 25–26, 2026), with the largest single commit at 5,054 insertions across 42 files. The architecture enforces a clean separation between reasoning and execution, not an aesthetic choice, but a reliability decision. LLMs hallucinate. Databases don’t. By never letting the intelligence layer touch persistence directly, and never asking the data layer to reason, each failure mode stays contained.
┌─────────────────────────────────────────────────────┐ │ INTELLIGENCE LAYER │ │ (Claude + Sub-Agents) │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Planner │ │ Creator │ │ Verifier │ │ │ │ (1 skill)│ │(6 skills)│ │(4 skills)│ │ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │ │ │ │ │ │ ┌────┴─────┐ ┌─────┴────┐ ┌────┴─────┐ │ │ │ Meta- │ │ Session/ │ │ Up to 10 │ │ │ │ cognition│ │ Publish │ │ parallel │ │ │ │ + Evolve │ │ │ │ sub-agents│ │ │ └──────────┘ └──────────┘ └──────────┘ │ │ │ │ 20 intent-based skills (purpose + rubrics + │ │ gates + output contracts, NOT procedures) │ ├──────────────────────────────────────────────────────┤ │ Direct bash import (bypassed MCP for reliability) │ ├──────────────────────────────────────────────────────┤ │ DATA / EXECUTION LAYER │ │ (Node.js + SQLite) │ │ │ │ ┌──────────┐ ┌──────────┐ ┌───────────┐ │ │ │ Memory │ │Execution │ │Observation│ │ │ │ Engine │ │ Engine │ │ Engine │ │ │ │ │ │ │ │ │ │ │ │episodes │ │gemini api│ │episode │ │ │ │rules │ │composit- │ │recording │ │ │ │templates │ │ing/sharp │ │confidence │ │ │ │retrieval │ │packaging │ │consolid- │ │ │ │ │ │pinterest │ │ation │ │ │ └──────────┘ └──────────┘ └───────────┘ │ │ │ │ SQLite (WAL mode) · 13 tables + archive │ │ Atomic writes · Session state persistence │ │ 46 TypeScript source files · ~95 functions │ └──────────────────────────────────────────────────────┘
Intelligence Layer
Twenty skill files organized into five groups: Planner, Creator (6 skills), Verifier (4 skills), Metacognition/Evolve, and Session/Publishing. Each skill defines purpose, quality rubrics, available tools, decision gates, and output contracts, not step-by-step procedures. Claude reads a skill once and derives the entire workflow from intent.
This was a deliberate architectural bet. A v1-to-v2 refactoring of seven core skills achieved a 74% line reduction (6,163 → 1,613 lines) while preserving all quality-critical domain knowledge. The system became more capable with less instruction because the instructions became higher-level.
Up to 10 parallel sub-agents (200K context each) handle discrete reasoning tasks: consolidation analysis, persona-based buyer evaluation, SEO copywriting, and pattern detection. A core design principle: no agent grades its own homework. The agent that generates images is never the agent that evaluates them.
Data/Execution Layer
Forty-six TypeScript source files organized into three engines: Memory (episodes, semantic rules, templates, retrieval), Execution (Gemini API image generation, multi-pass compositing, upscaling, packaging, Pinterest automation), and Observation (episode recording, Bayesian confidence scoring, consolidation analysis, persona evaluation, evolution scanning).
The server is invoked via direct bash import, not MCP protocol. MCP transport was unreliable inside the Cowork VM environment, and the extra abstraction layer added latency without benefit. Direct import is faster, more debuggable, and never drops connections mid-pipeline.
Storage is SQLite in WAL mode with 13 tables, plus an episodes_archive table for compacted old data. Every write is atomic (temp file + rename). Session state is persisted to data/session-state.json after every significant action, so any session can resume exactly where the last one stopped, even after a context window exhaustion or VM crash.
The Three-Tier Memory System
The memory architecture is the core of ShopForge’s self-improvement capability. It draws from A-MEM’s principle of autonomous memory organization (arXiv:2502.12110) and the “Memory in the Age of AI Agents” survey’s pattern of progressive abstraction (arXiv:2512.13564), but the specific design decisions were shaped by the constraints of operating inside a session-based LLM environment.
Research foundation: A-MEM (arXiv:2502.12110, Feb 2025) demonstrated that LLM agents benefit from memory systems that autonomously organize and interconnect knowledge, following Zettelkasten principles where memories generate their own contextual descriptions and form meaningful connections. The “Memory in the Age of AI Agents” survey (arXiv:2512.13564, Dec 2025) established a taxonomy of memory forms (token-level, parametric, latent) and functions (factual, experiential, working), with progressive abstraction as a core pattern. ShopForge synthesizes these ideas into a concrete three-tier hierarchy optimized for session-based operation: raw episodes, distilled semantic rules, and compiled templates.
Tier 1: Episodes (Raw Operational Data)
Every action records a structured episode:
{
id: string, // UUID, idempotent writes
type: "generation" | "qa" | "decision" | "publish" | "metric" |
"consolidation" | "research" | "storefront",
category: string, // Scoped domain (e.g., "prompt_engineering")
context: JSON, // Input state
outcome: JSON, // Result with scores/verdicts
tags: string[], // Searchable labels
product_id: string, // Links to product
confidence: number, // 0-1 score
consolidated_at: Date | null
}
A critical design decision: the system does not record success episodes by default. Only errors and enriched outcomes (those carrying scores, quality dimensions, or verdicts) are persisted. This prevents the episode store from bloating with low-signal noise, a problem that would degrade retrieval quality as the store scales. After 10 days, the system has 68 episodes across 8 types: a dense, high-signal dataset, not a noisy firehose.
Tier 2: Semantic Rules (Distilled Knowledge)
Rules are scoped across domains (prompt engineering, compositor, Etsy compliance, seasonal patterns, etc.) and carry evidence-linked confidence:
confidence = supporting / (supporting + contradicting + PRIOR)
// PRIOR = 2 (Bayesian smoothing constant)
The prior of 2 is the most important number in the system. It means a rule with 1 supporting episode and 0 contradictions gets a confidence of 0.33, not 1.0. A single observation is never enough for high confidence. It takes approximately 6 supporting episodes with zero contradictions to reach 0.75, the threshold for graduation eligibility.
Three types of rules coexist:
- Learned rules: Distilled from episode clusters during consolidation. Start at whatever confidence the evidence supports.
- Axioms: Human-asserted rules seeded from domain knowledge. Start at 0.5 with zero evidence, and can be promoted or demoted by real production data.
- Contradiction-aware rules: Rules where both supporting and contradicting evidence exists. The Bayesian formula handles this naturally: a rule with 6 supporting and 2 contradicting episodes gets 0.60 confidence, reflecting genuine ambiguity.
After 10 days, ShopForge has 15 active semantic rules (27 total including inactive and axioms awaiting evidence). Here are four real rules from the production database:
| Rule | Scope | Confidence | Evidence |
|---|---|---|---|
| product_id must exist before queueing Pinterest pin | 0.82 | 9 supporting, 0 contradicting. Learned from FK constraint failures | |
| Include “farmhouse” in titles and tags | seo | 0.80 | 8 supporting, 0 contradicting. From a real SEO audit |
| Use lifestyle room mockup hero photos over flat grids | curation | 0.60 | 6 supporting, 2 contradicting. Competitor teardown, but flat grids work for some styles |
| Consolidation API requires separate condition field | operational | 0.60 | Learned from the system’s own failed API calls |
Tier 3: Templates (Compiled Procedures)
Templates are the highest tier: compiled configurations that bundle proven prompts, generation parameters, listing copy patterns, and expected QA scores. They draw from Nakajima’s trajectory-to-exemplar pattern (2025), where successful execution traces are compiled into reusable starting points for future tasks.
Templates start at low confidence and require human review before gaining trust. They track usage count, last-used date, and a running-mean QA score (Knuth’s online mean algorithm) that auto-retires templates whose actual scores fall consistently below expectations.
The system currently has 2 templates, both early-stage, neither human-reviewed. This is honest: the template tier is the last to activate, because it requires enough semantic rules and production data to compile meaningful procedures.
Retrieval and Readiness
A single function, getRelevantContext, queries all three tiers and returns a readiness assessment: template_ready, rules_available, episodes_only, or no_data. This tells the agent exactly how much institutional knowledge exists for any given task, and it adjusts its behavior accordingly. A task with template_ready context runs faster and with lighter QA. A task with no_data triggers thorough evaluation and exploratory episode recording.
Self-Improvement: Three Nested Loops
The three-loop structure draws from three distinct research traditions.
Research foundation: The inner loop implements Reflexion (arXiv:2303.11366, Shinn et al., NeurIPS 2023), where an agent critiques its own output and retries within the same task using verbal reinforcement rather than weight updates. The middle loop follows the Self-Evolving Agents survey and EvoAgentX framework (arXiv:2508.07407), which formalize how accumulated experiences should be periodically consolidated into reusable knowledge. The outer loop is structured around the ICML 2025 position paper “Truly Self-Improving Agents Require Intrinsic Metacognitive Learning” (arXiv:2506.05109, Liu & van der Schaar), which argues that genuine self-improvement requires three components: metacognitive knowledge, metacognitive planning, and metacognitive evaluation, not just experience replay.
Inner Loop: Per-Task Reflexion
Generate 25 images. Score each on 7 weighted dimensions (artifact-free at 2.0× weight, technical quality and style match at 1.5×, composition at 1.0×). If more than 40% fail QA, the planner adjusts prompts and regenerates within the same run. The adaptation itself becomes an episode: “adjusted prompt by adding X, improved average score from 3.2 to 4.1.”
This loop is automatic and requires no user involvement. It operates within a single session and addresses immediate quality issues.
Middle Loop: Consolidation
Episodes accumulate. The system tracks store maturity and triggers consolidation at adaptive intervals:
| Store Maturity | Trigger | Min. Episodes |
|---|---|---|
| Early (<10 published) | After every product | 3 |
| Growth (10–30) | Every 3 products | 5 |
| Mature (30+) | Every 5 products | 5 |
During consolidation, unconsolidated episodes are clustered by (category, style, theme). Clusters below the minimum episode count are skipped; the system won’t draw conclusions from insufficient data. A sub-agent (Haiku for routine patterns, Sonnet for complex ones) analyzes each cluster and proposes new or updated semantic rules with explicit evidence links.
A critical safety mechanism: evidence counts are clamped. The consolidation sub-agent might claim a rule has 10 supporting episodes, but if only 4 episode IDs are actually linked, the count is clamped to 4. This prevents the LLM from hallucinating inflated confidence, a failure mode I observed during early testing where the consolidation agent would sometimes overstate its evidence.
Results are written in a single SQLite transaction. If any write fails, all writes roll back. After a successful consolidation, the system triggers an evolution scan for graduation candidates.
And as described in the opening, the consolidation loop is subject to its own self-improvement. The four failed consolidation attempts that taught the system its own API contract are the clearest proof that the architecture works: a real infrastructure failure became a persistent semantic rule that prevents the same failure from recurring.
Outer Loop: Metacognition and Skill Evolution
The metacognition skill generates system health reports: improvement curves by style, confidence maps across scopes, autonomy status per gate type, stale rule detection, and persona calibration accuracy. It synthesizes these into a narrative assessment with proposed strategic actions: experiment suggestions, scope adjustments, autonomy threshold changes.
When rules meet graduation criteria (confidence ≥ 0.75, minimum 3 evidence episodes, minimum 7 days active), the evolution skill proposes concrete edits to skill files. This applies the principle from SICA (arXiv:2504.15228, ICLR 2025 SSI-FM Workshop), a self-improving coding agent that eliminates the distinction between meta-agent and target agent, editing its own codebase to improve itself. ShopForge applies this to skill files rather than source code: high-confidence rules can be promoted directly into the Markdown instructions that define the agent’s behavior, permanently changing how it operates.
Safety validation checks for unsafe patterns: auto-publish instructions, in-memory caching that bypasses persistence, or dead MCP references from the legacy system. Low-risk changes can auto-apply; structural or safety-critical changes require explicit user approval. Every graduation is logged with a diff and backed by a file backup, making rollbacks trivial.
After 10 days and 68 episodes: zero rules have graduated. Average confidence across all 27 evaluated rules is 0.42, well below the 0.75 threshold. The 15 active rules are still accumulating evidence. The system is in the learning phase, not the automation phase. This is the architecture working as designed.
Quality Assurance: Three Tiers, 17 Gates
The QA pipeline has the most direct lineage to real production failures. Every gate exists because something went wrong.
Tier 1: Technical Validation (Automated, No LLM)
Resolution, DPI, format, file size, all checked against configurable thresholds. Prompts are linted against denylists. This tier is fast and deterministic.
Tier 2: Visual Inspection (Claude’s Assessment)
Seven-dimension scoring with calibrated weights. But the most valuable check in this tier is the corner crop analysis: the system extracts four 800×800 crops from each image corner and evaluates them independently for AI signatures and watermarks. This catches artifacts that full-image review misses approximately 30% of the time. AI-generated images frequently embed subtle signatures in corners that are invisible at full resolution but obvious when cropped and examined.
The listing photo QA sub-tier was born from a specific production failure. On February 24, 2026, I was reviewing published listings on the live Etsy shop and noticed problems that the existing QA had completely missed:
I’m noticing some issues in our listing photos: 1. Vintage French Countryside has images that look too plain, with a lot of white space. 2. Chateaucore French Estate has a sizing image with images covering the text. 3. The ‘What’s included’ image has the ‘INSTANT DOWNLOAD’ text being cut off.
These weren’t subtle issues; they were visible to any buyer browsing the shop. Claude’s root cause analysis revealed that the QA pipeline at that point was technical-only: it checked resolution and file size but had zero blocker checks for insufficient staging, text obstruction, or text clipping. Six new blocker rules were added to the QA rubric in that session.
During the fix, I caught another failure in real time: “I notice the new size guide is being cut off on the right side now”, revealing a width calculation bug in the compositor. Each failure was hardened into the skill system, where it prevents regression permanently.
This is the pattern that repeats throughout ShopForge’s development: a real failure in production → root cause analysis → a new gate or blocker rule → encoded into a skill file → never happens again.
Additional Tier 2 gates include:
- GREEN_BLEED_BLOCKER: Detects green-screen compositing artifacts using three simultaneous color detection methods (RGB ratio, HSV, and relative green excess). Any single method misses edge cases; the combined approach catches artifacts at a 0.05% threshold.
- HERO_DIVERSITY_BLOCKER: Enforces round-robin rotation across 5 frame/room variations with minimum-separation constraints, so the shop grid never looks templated.
- FRAME_ASPECT_MISMATCH: Rejects composited photos where the frame orientation doesn’t match the source art aspect ratio by more than 15%.
Tier 3: Persona-Based Buyer Evaluation
Three buyer personas (Sarah the Intentional Decorator, Emma the Gift-Giver, Rachel the Etsy Power Shopper) evaluate listings across 8 dimensions: purchase intent, cohesion, style match, framability, overall appeal, click appeal, description quality, and price perception. Verdicts are scored: strong (≥ 4.0), acceptable (≥ 3.0), weak (< 3.0). The persona-based approach draws from PersonaCite’s work on VoC-grounded synthetic AI personas (arXiv:2601.22288), adapted from user research methodology to e-commerce quality evaluation.
QA depth adapts to confidence. High-confidence contexts (rule confidence ≥ 0.85) trigger light QA with 1 persona. Medium confidence (≥ 0.60) triggers standard QA with 2 personas. Low confidence (< 0.60) triggers thorough QA with all personas. The system expends more evaluation effort where it’s less certain, a direct application of the Bayesian confidence framework to resource allocation.
Persona calibration tracking exists but has no data yet: zero sales means zero predicted-vs-actual comparisons. This is the feedback loop that will close as the shop matures.
The First Full Pipeline Run
On March 4, 2026, the Spring Botanical Herbarium Illustration 20-Pack became the first product to complete the full ShopForge pipeline end-to-end.
The QA score was 4.76/5.0 with 100% generation success and zero rejections. On paper, a clean run. In practice, it surfaced four problems the system hadn’t encountered before:
Resolution gap. Images were generated at 2K instead of 4K. Gemini’s imageSize parameter wasn’t being passed correctly through the generation function. The images looked fine at screen resolution but would print poorly. This became a hard gate: LISTING_PHOTO_RESOLUTION now rejects any image with a shortest side under 2,000 pixels.
Staging explosion. The generation process created 85 separate staging folders, one per API call. Functionally correct, operationally a mess. The staging cleanup system now consolidates per-product.
Green fringe artifacts. Composited listing photos showed green fringe at the edges where the art met the frame mockup. A single color detection method wasn’t sufficient. The fix required a 4-pass defringe pipeline using three simultaneous detection methods (RGB ratio, HSV analysis, relative green excess). This became the GREEN_BLEED_BLOCKER gate.
Video iteration. The listing video went through 4 iterations to resolve jitter, bounce, and zoom issues. Each iteration’s failures were recorded as episodes, and the video generation skill now carries the constraints learned from those failures.
The pipeline took approximately 16 hours wall-clock time, exhausted the context window 3–4 times, and produced a publishable listing. Every issue discovered was documented back into skill files the same day. There was no celebration moment. The goal was always a system that catches its own problems, not one that works perfectly on the first try.
Trust Engineering: Why “Never Auto-Publish” Appears Eight Times
The most frequently asked question about ShopForge is some variant of “so it just publishes listings automatically?” No. The system’s #1 unbreakable rule is: never auto-publish to Etsy. Draft only. The user publishes manually.
This rule appears in 8 separate skill files. The redundancy is intentional. Every skill that could plausibly reach the publishing step re-states the constraint, because skills are loaded independently and an agent operating with only one skill in context should still respect the boundary.
The PUBLISH_GATE is a compound gate that requires:
- AI disclosure present in 3 mandatory locations (Etsy policy compliance)
- All technical specifications validated
- All images passing visual QA including corner crop checks
- User explicitly approved
Seventeen gates total cover image-level, pack-level, listing-level, and system-level decisions. Each gate has a defined blocker threshold and a documented failure response. The evolution scanner validates proposed rule graduations against unsafe patterns; any rule that would introduce auto-publishing, in-memory caching that bypasses persistence, or references to dead MCP tools from the legacy system is rejected automatically.
Graduated Autonomy
Decision gates track their own approval history. A gate earns auto-approval when its historical approval rate exceeds 90% across at least 5 precedents. Three tiers exist: auto_approved (earned trust), recommending (strong track record, user confirms), and full_review (always requires human evaluation).
After 10 days: zero gates have reached auto-approval status. The system is in full_review for every decision. This is the correct behavior for a system with limited operational history; trust is earned through demonstrated competence, not configured by default.
Honest Assessment: What’s Working and What Isn’t
Working
The memory system is accumulating real signal. 68 episodes across 8 types. 15 active semantic rules with evidence-linked confidence scores. The Bayesian scoring produces meaningful differentiation: a 0.82-confidence operational rule behaves differently in the system than a 0.33-confidence stylistic hypothesis.
Rules are learning from real failures. The consolidation API bug → learned rule about field naming. The listing photo QA gaps → 6 new blocker rules. The size-claim overstatement → verification rule (“3392/300 = 11.3 inches, not 18”; listing copy must be verified against actual pixel dimensions). The Pinterest FK constraint → queue ordering rule. Each of these was a real production failure that produced a real, persistent improvement.
The architecture handles session interruptions gracefully. Context window exhaustions, VM crashes, and network drops have all occurred during the 10-day period. Atomic state writes and session-state persistence mean the system resumes cleanly every time.
Confidence is climbing where it should. Image generation scope rules rose from 0.33–0.60 to 0.74 average confidence. Operational rules reached 0.99. The Bayesian scoring is producing the convergence curves the research predicts.
Not Working Yet
No revenue. 17 listings, zero sales. The shop is 10 days old with $1/day in Etsy Ads, so this is expected but not demonstrated.
QA scoring lacks structured data. Visual QA verdicts were recorded as PASS/FAIL strings, not numeric scores. The improvement curve functions exist but return empty results because there’s no numeric data to trend. This is a recording gap, not an architecture gap, fixable by updating the episode recording to capture dimension scores.
Persona calibration is theoretical. The framework tracks predicted buyer scores against actual market performance, but with zero sales there are zero data points. The calibration loop is built but untested.
The evolution loop hasn’t fired. No rules have been promoted into skill files. Average confidence across all evaluated rules is 0.42, well below the 0.75 graduation threshold. This is the architecture working as designed (the system shouldn’t graduate rules prematurely), but it means the outer self-improvement loop is unproven in production.
No cost tracking or circuit breakers. Every Gemini API call hits the API directly with no caching and no spending alerts. For a system designed to run autonomously, this is a gap that matters as scale increases.
What’s Transferable
ShopForge is pointed at an Etsy shop, but the architecture is domain-agnostic. To make this concrete: consider an AI agent that reviews pull requests for a development team. The three-tier memory maps directly: episodes capture every review (what was flagged, what the developer accepted or rejected), semantic rules distill patterns (“functions over 50 lines in this codebase are 3× more likely to need refactoring”), and templates compile proven review checklists per file type. The Bayesian confidence framework means the agent starts cautious and earns credibility through demonstrated accuracy. Graduated autonomy means it begins by suggesting and eventually auto-approves low-risk reviews, but only after its approval rate proves it out.
The same pattern applies to any repetitive, quality-sensitive process: content moderation, data pipeline validation, customer support triage, compliance checking. The core requirements are always the same:
- Learn from operational history without fine-tuning a model
- Earn trust incrementally through demonstrated competence
- Improve its own procedures based on evidence, with human oversight on structural changes
- Handle session discontinuity in environments where context windows expire and processes restart
The research papers cited throughout this case study provide the theoretical foundation. ShopForge provides a production implementation that tests those theories against real constraints: unreliable infrastructure, limited context windows, cold start problems, and the fundamental challenge of building systems that improve themselves without losing the trust of the humans who operate them.
ShopForge is an active project. The project page has the current technical specification. The ChateaucoreWalls Etsy shop is the live production environment. This case study will be updated as the system matures and the feedback loops close.
Research Citations
| Paper | How It Informed ShopForge |
|---|---|
| A-MEM (arXiv:2502.12110, Feb 2025) | Autonomous memory organization principles; inspired the three-tier hierarchy |
| Memory in the Age of AI Agents (arXiv:2512.13564, Dec 2025) | Progressive abstraction taxonomy; consolidation as a first-class operation |
| Reflexion (arXiv:2303.11366, Shinn et al., NeurIPS 2023) | Inner loop: per-task self-critique and retry with verbal reinforcement |
| Self-Evolving Agents Survey + EvoAgentX (arXiv:2508.07407) | Middle loop: framework for classifying and implementing adaptive consolidation of experiences |
| Metacognitive Learning (arXiv:2506.05109, Liu & van der Schaar, ICML 2025 Position Paper) | Outer loop: metacognitive knowledge, planning, and evaluation as three components of genuine self-improvement |
| SICA (arXiv:2504.15228, Robeyns et al., ICLR 2025 SSI-FM Workshop) | Skill evolution: agents modifying their own capabilities through self-modification |
| PersonaCite (arXiv:2601.22288, Truss, Jan 2026) | Inspired persona-based evaluation; adapted from VoC-grounded user research to e-commerce QA |
| Nakajima (2025) | Trajectory-to-exemplar pattern for template compilation |
| Microsoft ISE Multi-Agent Patterns | Planner-Worker-Judge architecture for sub-agent coordination |