AI Research Pipeline | Bryce Watson

Status: Released, Phases 1–3 complete (CLI MVP, multi-agent orchestration, quality and evaluation with model selection). View on GitHub.

A multi-agent retrieval-augmented research assistant that combines cloud LLM reasoning, live research APIs, and persistent vector memory with intelligent model selection and factual-consistency scoring.

The Problem

Research workflows that lean on a single LLM run into three recurring failures. The model picks a planning strategy that’s wrong for the complexity of the query. It synthesizes from its own pre-training instead of live sources. And it hallucinates citations or stitches contradictory sources together without flagging the conflict. A general-purpose chat interface gives you no way to harden any of those steps.

The Approach

A three-agent workflow with model selection scoped to each agent’s task:

Seed Agent decomposes the query and plans the search. Runs on o1-mini because the work is structural, not synthesis.
Sourcing Agent calls research APIs, filters and evaluates content. Runs on sonar-pro (Perplexity) for live web research.
Research Agent does retrieval, synthesis, and conflict detection across local Chroma vectors and live results. Runs on o1 for the heavier reasoning.

A model selection layer routes by context: smaller models for planning, larger models for synthesis, and a high-context fallback (gpt-4.1) when the working set crosses ~100k tokens. Fallback chains catch model errors without dropping the run.

What’s Implemented

Three-agent orchestration: Seed → Sourcing → Research, with structured handoffs and per-agent prompt templates
Local vector store: Chroma DB with persistent memory across sessions; python app/cli.py add ingests text, titles, and URLs
Live research: Perplexity Sonar integration for real-time web sources, mixed with local citations and conflict detection
Vectara factual consistency scoring (FCS): runs on synthesized responses; combined with model confidence and citation quality into a multi-factor confidence score
Performance monitoring: real-time metrics on model usage, success rates, and per-model comparison; exportable
CLI surface: ask, add, stats, report, models, performance, select-model
Test suite: end-to-end basic pipeline test, smoke tests, and model-selection tests under tests/

Architecture

Query → Seed Agent (o1-mini)         # plan search
      → Sourcing Agent (sonar-pro)   # fetch + evaluate live sources
      → Research Agent (o1)          # synthesize against Chroma + live
      → Vectara FCS                  # score factual consistency
      → Response + multi-factor confidence

Configuration lives in model_config.yaml. The default chain pairs OpenAI’s reasoning models with sonar-pro for research and falls back to gpt-4.1 for very large contexts. Embeddings use text-embedding-3-large; reranking uses Cohere’s rerank-english-v3.0. Alternate LLMs (Claude 3.5 Sonnet) can be plugged in for diverse perspectives.

What This Demonstrates

Multi-agent orchestration with per-agent model selection grounded in actual cost/quality tradeoffs, not “one model for everything”
Hybrid RAG with both persistent local vectors and live web research, with explicit conflict surfacing rather than silent merging
Quality engineering for LLM outputs: factual consistency scoring, multi-factor confidence, and performance instrumentation built into the pipeline rather than bolted on after