Source Ingestion Runbook
This runbook captures the operational protocol for turning an external source into grounded vault knowledge. It is the practiced workflow behind the spec-agentic-source-orchestrator and spec-firecrawl-pgvector-pipeline.
Role Sequence
The ingestion loop is deliberately multi-agent:
- Gemini / Librarian prepares the intake packet: target URL, gap rationale, expected note, expected links, and policy/HITL assessment.
- Human / HITL approves bounded crawls when policy requires approval or when the source is foundational.
- Codex / Engineer runs the mechanical pipeline: dry run, live crawl, index, verify, and seam handoff. Codex does not synthesize or promote the permanent note in the same pass.
- Claude / Chronicler synthesizes verified chunks into a literature or permanent note with provenance.
- Gemini / Librarian performs final graph integration: MOC wiring, backlinks, index updates, log entry, and commit hygiene.
This preserves the Two-Role Invariant: no single agent both ingests source material and promotes it to permanent knowledge.
Preflight
Before running an external crawl, Codex should verify the local tool surface:
pwsh -NoProfile -ExecutionPolicy Bypass -File 02_System\check-mcp-health.ps1
Use live credential checks only when needed:
pwsh -NoProfile -ExecutionPolicy Bypass -File 02_System\check-mcp-health.ps1 -LiveServices
Use -LiveFirecrawl only for explicit Firecrawl credential validation, because it calls the external service.
Intake Packet
The Librarian intake should include:
- exact target URL
- reason this source closes a vault gap
- expected target note path and note type
- expected key wikilinks
- expected page count
- whether HITL is required or already granted
- explicit instruction not to broaden the crawl
For single-page documentation ingestion, prefer expected_pages: 1.
Codex Execution Steps
Codex runs the pipeline in two phases.
1. Dry Run
Run:
propose_source_intakeorchestrate_ingestionwithdry_run: trueexecute_source_crawlwithdry_run: true
Proceed only if the dry run remains bounded: correct domain, expected page count, estimated credits, and HITL status.
2. Live Run
Run:
orchestrate_ingestionwithdry_run: falseexecute_source_crawlwithdry_run: falseindex_crawled_sourceverify_source_index
Stop before synthesis unless the source index verifies with no blocking findings.
Required Evidence IDs
Every successful Codex handoff must preserve:
- original requested URL
- resolved indexed URL, especially if a redirect occurred
- Firecrawl crawl job ID
- Supabase
source_pages.id - chunk count
- all chunk IDs used later by the synthesizing agent
- verification status and findings
These IDs are the bridge between the sidecar evidence and the promoted vault note.
Redirect Handling
If a requested URL redirects, record both URLs. If Firecrawl returns zero pages for the redirecting URL, verify the resolved canonical URL in a browser or other read-only check, then repeat the dry run against the resolved URL before running the live crawl.
The MCP Security Best Practices ingestion used this path: the dated specification URL redirected to the current docs/tutorials URL, and the resolved URL was the one successfully indexed.
Synthesis Handoff
After verification, Codex records a seam for Claude/Gemini with:
- source page ID
- chunk count
- crawl job ID
- source URL and resolved URL
- recommended target note path
- required links
- instruction not to crawl again
Claude should retrieve evidence through semantic_search_sources, write the target note, preserve provenance, validate with audit-yanp.ps1 and check-broken-links.ps1, and commit only the note plus any explicit handoff artifact.
Gemini should then integrate the note into MOCs, backlinks, specs, and logs without changing the evidence provenance unless it finds a concrete mismatch.
Commit Boundaries
Prefer three small commits:
feat(wiki): add <source> literature notedocs(handoff): graph integration brief for Gemini re <source>chore(graph): wire <source> into <cluster> cross-references
Do not mix mechanical ingestion tooling changes with source synthesis commits.
Provenance Shape
Current synthesized notes may use source_page_id plus chunk_ids, while some older notes use nested provenance: frontmatter. The YANP auditor accepts both today. A future normalization pass should decide one canonical shape and migrate notes consistently.