Spec: Agentic Source Orchestrator

Purpose: Define the unified multi-agent orchestration layer for the spec-firecrawl-pgvector-pipeline. This spec establishes the "Knowledge Compiler" protocol, transforming raw web ingestion into a hardened, auditable, and epistemically verified knowledge graph within the YANP framework.

1. The Multi-Agent Trinity (Roles)

Execution is distributed across the fleet to enforce the Two-Role Invariant: *No single agent may both ingest source material and promote it to a permanent note.*

Agent	Responsibility	Primary Metrics
Gemini (Librarian)	Gap Detection & Discovery: Identifies vault gaps, maps source candidates, manages the Index and Log.	Gap-to-Crawl Ratio, Index Accuracy.
Codex (Engineer)	Control & Infrastructure: Manages the state machine, SQL schemas, MCP Tool implementation, and policy enforcement.	System Uptime, Policy Compliance, Audit Integrity.
Claude (Chronicler)	Synthesis & Intelligence: Distills claims from chunks, resolves conflicts, and ensures epistemic quality.	Evidence Coverage, Claim-to-Evidence Ratio.

2. The Source Intake Lifecycle (State Machine)

The orchestrator manages a strict 8-stage lifecycle. No stage may be bypassed.

Proposed: Agent registers a source request based on a vault gap or user directive.
Mapped: Firecrawl /map identifies the URL graph and cost estimate.
Approved: Human or policy gate grants permission to crawl.
Crawled: Raw Markdown is fetched and stored in the sidecar.
Indexed: Content is chunked, embedded, and stamped with provenance.
Verified: Integrity checks (coherence, metadata, sample retrieval) pass.
Synthesized: A draft note is created with claim-level evidence citations.
Promoted: The draft is reviewed and merged as a Permanent Note in 01_Wiki/.

3. Epistemic Risk & Quality Gates

The system treats knowledge as Compiled Output. Claims must pass the following risk classification before promotion:

Tier	Classification	Action Required
T0 — Fabrication	No supporting evidence or inference marker.	Reject immediately.
T1 — Weak Evidence	Tangential or low-similarity chunks.	Flag for human arbitration.
T2 — Unmarked Inference	Valid conclusion but missing "Derived" label.	Annotate and proceed.
T3 — Stale Evidence	Evidence retrieved > 90 days ago.	Stamp as stale; queue re-ingestion.
T4 — Conflict	Contradicts existing permanent notes.	Trigger Conflict Resolution (§5).
T5 — Verified	Direct support from high-similarity, fresh chunks.	Eligible for promotion.

4. The MCP Tool Surface (The Reflex)

The orchestrator is exposed via a unified MCP server, ensuring all agents share the same operational interface.

propose_source_intake: Register a request; validates against denied_domains.
orchestrate_ingestion: Generates the map and cost estimate.
execute_source_crawl: Performs the crawl; enforces the credit quota.
index_crawled_source: Chunks and embeds using the Heading-Aware strategy.
verify_source_index: Runs integrity and provenance audits.
semantic_search_sources: Retrieves attributed chunks with metadata.
promote_synthesis_candidate: Compiles the "Promotion Packet" for review.

5. Conflict Resolution & Arbitration

When an incoming claim contradicts the existing graph, the agent must choose a resolution path:

Update: New evidence is fresher/more authoritative. Supersede old claim.
Narrow: Existing claim is too general. Add scope qualifiers.
Parallelize: Both are true under different conditions (e.g., versions). Create conditioned notes.
Escalate: High-stakes contradiction. Trigger AUTH_REQUIRED for human review.

6. Policy Framework (`pipeline-policy.yaml`)

version: 1.0
quotas:
  max_credits_per_session: 100
  max_pages_per_source: 200
safety:
  denied_domains: ["*.social", "reddit.com"]
  require_hitl_for_costs_over: 20
  require_hitl_for_new_domain: true
synthesis:
  freshness_threshold_days: 90
  min_similarity_threshold: 0.78
  require_provenance_block: true

7. Provenance & Audit Trail

Every permanent note promoted via this orchestrator MUST contain YAML provenance fields in its frontmatter:

provenance_source_ids: ["sr_123"]
provenance_chunk_ids: ["chk_456", "chk_457"]
provenance_retrieved: "2026-04-29T12:00:00Z"
provenance_agent: "claude-chronicler"

This creates a stable link between the Wiki (Permanent Knowledge) and the Sidecar (Evidence Chunks).

References

protocol-source-ingestion-runbook — Operational procedures and multi-agent role sequence
spec-firecrawl-pgvector-pipeline
firecrawl-crawling-capabilities
firecrawl-map-capabilities
firecrawl-scrape-capabilities
firecrawl-api-v2-reference
the-compounding-artifact
visitor-directives
agent-note-conventions