Spec: Agentic Source Orchestrator
Purpose: Define the unified multi-agent orchestration layer for the spec-firecrawl-pgvector-pipeline. This spec establishes the "Knowledge Compiler" protocol, transforming raw web ingestion into a hardened, auditable, and epistemically verified knowledge graph within the YANP framework.
1. The Multi-Agent Trinity (Roles)
Execution is distributed across the fleet to enforce the Two-Role Invariant: *No single agent may both ingest source material and promote it to a permanent note.*
| Agent | Responsibility | Primary Metrics |
|---|---|---|
| Gemini (Librarian) | Gap Detection & Discovery: Identifies vault gaps, maps source candidates, manages the Index and Log. | Gap-to-Crawl Ratio, Index Accuracy. |
| Codex (Engineer) | Control & Infrastructure: Manages the state machine, SQL schemas, MCP Tool implementation, and policy enforcement. | System Uptime, Policy Compliance, Audit Integrity. |
| Claude (Chronicler) | Synthesis & Intelligence: Distills claims from chunks, resolves conflicts, and ensures epistemic quality. | Evidence Coverage, Claim-to-Evidence Ratio. |
2. The Source Intake Lifecycle (State Machine)
The orchestrator manages a strict 8-stage lifecycle. No stage may be bypassed.
- Proposed: Agent registers a source request based on a vault gap or user directive.
- Mapped: Firecrawl
/mapidentifies the URL graph and cost estimate. - Approved: Human or policy gate grants permission to crawl.
- Crawled: Raw Markdown is fetched and stored in the sidecar.
- Indexed: Content is chunked, embedded, and stamped with provenance.
- Verified: Integrity checks (coherence, metadata, sample retrieval) pass.
- Synthesized: A draft note is created with claim-level evidence citations.
- Promoted: The draft is reviewed and merged as a Permanent Note in
01_Wiki/.
3. Epistemic Risk & Quality Gates
The system treats knowledge as Compiled Output. Claims must pass the following risk classification before promotion:
| Tier | Classification | Action Required |
|---|---|---|
| T0 — Fabrication | No supporting evidence or inference marker. | Reject immediately. |
| T1 — Weak Evidence | Tangential or low-similarity chunks. | Flag for human arbitration. |
| T2 — Unmarked Inference | Valid conclusion but missing "Derived" label. | Annotate and proceed. |
| T3 — Stale Evidence | Evidence retrieved > 90 days ago. | Stamp as stale; queue re-ingestion. |
| T4 — Conflict | Contradicts existing permanent notes. | Trigger Conflict Resolution (§5). |
| T5 — Verified | Direct support from high-similarity, fresh chunks. | Eligible for promotion. |
4. The MCP Tool Surface (The Reflex)
The orchestrator is exposed via a unified MCP server, ensuring all agents share the same operational interface.
propose_source_intake: Register a request; validates againstdenied_domains.orchestrate_ingestion: Generates the map and cost estimate.execute_source_crawl: Performs the crawl; enforces the credit quota.index_crawled_source: Chunks and embeds using the Heading-Aware strategy.verify_source_index: Runs integrity and provenance audits.semantic_search_sources: Retrieves attributed chunks with metadata.promote_synthesis_candidate: Compiles the "Promotion Packet" for review.
5. Conflict Resolution & Arbitration
When an incoming claim contradicts the existing graph, the agent must choose a resolution path:
- Update: New evidence is fresher/more authoritative. Supersede old claim.
- Narrow: Existing claim is too general. Add scope qualifiers.
- Parallelize: Both are true under different conditions (e.g., versions). Create conditioned notes.
- Escalate: High-stakes contradiction. Trigger
AUTH_REQUIREDfor human review.
6. Policy Framework (pipeline-policy.yaml)
version: 1.0
quotas:
max_credits_per_session: 100
max_pages_per_source: 200
safety:
denied_domains: ["*.social", "reddit.com"]
require_hitl_for_costs_over: 20
require_hitl_for_new_domain: true
synthesis:
freshness_threshold_days: 90
min_similarity_threshold: 0.78
require_provenance_block: true
7. Provenance & Audit Trail
Every permanent note promoted via this orchestrator MUST contain YAML provenance fields in its frontmatter:
provenance_source_ids: ["sr_123"]
provenance_chunk_ids: ["chk_456", "chk_457"]
provenance_retrieved: "2026-04-29T12:00:00Z"
provenance_agent: "claude-chronicler"
This creates a stable link between the Wiki (Permanent Knowledge) and the Sidecar (Evidence Chunks).
References
- protocol-source-ingestion-runbook — Operational procedures and multi-agent role sequence
- spec-firecrawl-pgvector-pipeline
- firecrawl-crawling-capabilities
- firecrawl-map-capabilities
- firecrawl-scrape-capabilities
- firecrawl-api-v2-reference
- the-compounding-artifact
- visitor-directives
- agent-note-conventions