Spec: Firecrawl + Postgres/pgvector Source Pipeline
Purpose: Define the architecture for crawling external documentation sites, storing canonical source records in Postgres, chunking and embedding them with provenance, and retrieving semantically relevant chunks for a research assistant. This pipeline is source material only — it produces raw, attributed chunks for downstream synthesis agents. It is the ingestion layer of the vault's knowledge compiler.
External Docs/Sites
│
▼
[Firecrawl API] ← crawl, clean, extract markdown
│
▼
[Canonical Store] ← Postgres: one row per source page
│
▼
[Chunk + Embed] ← split by heading → fixed-size, generate vectors
│
▼
[pgvector Index] ← Supabase: HNSW cosine index on chunks
│
▼
[match_documents()] ← semantic search RPC → research assistant
1. Firecrawl — Crawl & Extraction Layer
Firecrawl is a hosted crawling service that returns LLM-optimized Markdown from any URL, handling JS rendering, bot detection, and content extraction. It is not a chunker — all splitting happens downstream.
1.1 Key Endpoints
| Endpoint | Use | Returns |
|---|---|---|
POST /v2/scrape |
Single page | Markdown + metadata, synchronous |
POST /v2/crawl |
Multi-page site | Job ID; poll for results (async) |
POST /v2/map |
Discover all URLs | URL list only, no content |
POST /v2/extract |
Structured LLM extraction | JSON matching a provided schema |
POST /v2/search |
Web search + extract | Top-N results as markdown |
1.2 Scrape Request
POST /v2/scrape
{
"url": "https://docs.firecrawl.dev/features/crawl",
"formats": ["markdown", "links"],
"onlyMainContent": true,
"excludeTags": ["nav", "footer", "header", ".sidebar"],
"waitFor": 1000
}
| Parameter | Type | Notes |
|---|---|---|
formats |
string[] |
"markdown", "html", "links", "screenshot", "extract" |
onlyMainContent |
bool |
Strips nav/footer/sidebar — always true for doc ingestion |
excludeTags |
string[] |
CSS selectors to remove before extraction |
waitFor |
int (ms) |
Wait for JS to render before scraping |
1.3 Scrape Response Object
{
"success": true,
"data": {
"markdown": "# Page Title\n\nContent...",
"links": ["https://docs.example.com/page-2", "..."],
"metadata": {
"title": "Page Title",
"description": "Meta description",
"language": "en",
"keywords": ["keyword1"],
"statusCode": 200,
"contentType": "text/html",
"sourceURL": "https://docs.example.com/page-1"
}
}
}
1.4 Crawl Request (Multi-Page)
POST /v2/crawl
{
"url": "https://docs.example.com",
"limit": 500,
"maxDiscoveryDepth": 3,
"includePaths": ["^/docs/", "^/api/"],
"excludePaths": ["^/blog/", "^/changelog/", "\\.pdf$"],
"scrapeOptions": {
"formats": ["markdown"],
"onlyMainContent": true
}
}
Returns { "id": "crawl_job_abc123" } immediately. Poll GET /v2/crawl/{id} until status: "completed". Results are an array of the same scrape response objects.
Rate limits: Free 500 pages/month · Hobby 3k · Standard 100k (50 concurrent). Use includePaths aggressively to avoid wasting credits on irrelevant pages.
2. Postgres Schema — Canonical Source Store
Two tables: source_pages holds the raw canonical record (one per URL); source_chunks holds the processed chunks with embeddings and provenance. This separation means re-chunking or re-embedding never touches the canonical record.
-- Enable the pgvector extension (Supabase: already available in extensions schema)
CREATE EXTENSION IF NOT EXISTS vector WITH SCHEMA extensions;
CREATE EXTENSION IF NOT EXISTS pgcrypto; -- for gen_random_uuid()
-- Canonical record: one row per crawled URL
CREATE TABLE source_pages (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
url TEXT NOT NULL UNIQUE,
title TEXT,
description TEXT,
language TEXT DEFAULT 'en',
markdown TEXT NOT NULL, -- raw Firecrawl output, untouched
status_code INT,
crawled_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
etag TEXT, -- HTTP ETag for dedup on re-crawl
last_modified TEXT, -- HTTP Last-Modified header
content_hash TEXT NOT NULL, -- SHA-256 of markdown for change detection
domain TEXT GENERATED ALWAYS AS (
split_part(regexp_replace(url, 'https?://', ''), '/', 1)
) STORED
);
CREATE INDEX idx_source_pages_url ON source_pages(url);
CREATE INDEX idx_source_pages_domain ON source_pages(domain);
CREATE INDEX idx_source_pages_hash ON source_pages(content_hash);
-- Chunks: one row per embedded chunk, all provenance carried inline
CREATE TABLE source_chunks (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
page_id UUID NOT NULL REFERENCES source_pages(id) ON DELETE CASCADE,
-- Content
content TEXT NOT NULL,
content_hash TEXT NOT NULL, -- SHA-256 of chunk text; skip re-embed if unchanged
-- Provenance (fully denormalized for fast retrieval — no join needed at query time)
source_url TEXT NOT NULL,
domain TEXT NOT NULL,
page_title TEXT,
section_heading TEXT, -- nearest H1/H2/H3 above this chunk
chunk_index INT NOT NULL, -- 0-based position within the page
chunk_total INT NOT NULL, -- total chunks for this page
crawled_at TIMESTAMPTZ NOT NULL,
-- Vector
embedding extensions.vector(1536), -- set D to match your embedding model
-- Timestamps
embedded_at TIMESTAMPTZ,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_chunks_page_id ON source_chunks(page_id);
CREATE INDEX idx_chunks_content_hash ON source_chunks(content_hash);
CREATE INDEX idx_chunks_domain ON source_chunks(domain);
-- HNSW index for cosine similarity (build after initial bulk insert)
CREATE INDEX idx_chunks_embedding ON source_chunks
USING hnsw (embedding extensions.vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
Why HNSW over IVFFlat?
- HNSW builds on an empty table — no training step, so it's safe for incremental inserts.
- Better query speed-recall tradeoff at the cost of higher memory.
m=16andef_construction=64are good defaults; tune up tom=32, ef_construction=128for higher recall at the cost of index build time.
Embedding dimensions: Set vector(D) to match your model:
| Model | D |
|---|---|
OpenAI text-embedding-3-small |
1536 |
OpenAI text-embedding-ada-002 |
1536 |
OpenAI text-embedding-3-large |
3072 |
Cohere embed-english-v3.0 |
1024 |
gte-small (open source) |
384 |
3. Chunking Strategy
Firecrawl returns clean Markdown — the pipeline splits it before embedding. Recommended: heading-aware hybrid chunking.
3.1 Algorithm
import re, hashlib
from typing import Iterator
MAX_CHUNK_TOKENS = 512 # ~400 words
OVERLAP_TOKENS = 50 # ~40 words; carry context across chunk boundary
def chunk_markdown(markdown: str, page_meta: dict) -> list[dict]:
"""Split markdown by heading, then fixed-size within each section."""
sections = split_by_heading(markdown) # list of (heading, text) tuples
chunks = []
idx = 0
for heading, section_text in sections:
for window in sliding_window(section_text, MAX_CHUNK_TOKENS, OVERLAP_TOKENS):
chunks.append({
"content": window,
"content_hash": sha256(window),
"section_heading": heading,
"chunk_index": idx,
"source_url": page_meta["url"],
"domain": extract_domain(page_meta["url"]),
"page_title": page_meta["title"],
"crawled_at": page_meta["crawled_at"],
})
idx += 1
# Backfill chunk_total now that we know it
for c in chunks:
c["chunk_total"] = idx
return chunks
def split_by_heading(markdown: str) -> list[tuple[str, str]]:
"""Split on H1/H2/H3 headings. Returns list of (heading_text, section_body)."""
pattern = re.compile(r'^(#{1,3} .+)$', re.MULTILINE)
parts = pattern.split(markdown)
# parts alternates: [preamble, heading, body, heading, body, ...]
result = [("(preamble)", parts[0])]
for i in range(1, len(parts), 2):
heading = parts[i].lstrip('#').strip()
body = parts[i+1] if i+1 < len(parts) else ""
if body.strip():
result.append((heading, body))
return result
def sha256(text: str) -> str:
return hashlib.sha256(text.encode()).hexdigest()
3.2 Chunking Rules of Thumb
| Decision | Guidance |
|---|---|
| Chunk size | 256–512 tokens for precise retrieval; 512–1024 for richer context |
| Overlap | 10–20% of chunk size; ensures no fact is cut across a boundary |
| Split boundaries | Always prefer heading → paragraph → sentence over mid-sentence cuts |
| Minimum chunk | Discard chunks < 50 words — too thin to embed meaningfully |
| Section heading | Carry the nearest ancestor heading (H1 > H2 > H3), not just the immediate one |
4. Deduplication
Two levels prevent wasted API calls and embedding credits.
4.1 URL-Level (ETag / Last-Modified)
Before re-crawling a URL, issue an HTTP HEAD request:
resp = httpx.head(url, headers={"If-None-Match": stored_etag})
if resp.status_code == 304:
return # content unchanged — skip Firecrawl entirely
Store etag and last_modified in source_pages. Update on every full crawl.
4.2 Content-Level (SHA-256 Hash)
After chunking, check if each chunk's content_hash already exists:
SELECT id FROM source_chunks WHERE content_hash = $1;
If the hash exists, skip the embedding API call and reuse the existing row. This reduces embedding costs ~90% for stable documentation sites on re-crawl.
Upsert pattern for source_pages:
INSERT INTO source_pages (url, title, markdown, content_hash, crawled_at, etag, ...)
VALUES ($1, $2, $3, $4, NOW(), $5, ...)
ON CONFLICT (url) DO UPDATE
SET markdown = EXCLUDED.markdown,
content_hash = EXCLUDED.content_hash,
crawled_at = EXCLUDED.crawled_at,
etag = EXCLUDED.etag
WHERE source_pages.content_hash != EXCLUDED.content_hash;
-- WHERE clause: only update if content actually changed
When a page's content_hash changes on re-crawl: delete its old chunks (ON DELETE CASCADE handles this), re-chunk, re-embed, insert fresh chunks.
5. Semantic Search — match_documents RPC
Supabase's PostgREST layer cannot call pgvector operators directly from the client. Wrap the similarity query in a Postgres function:
CREATE OR REPLACE FUNCTION match_documents(
query_embedding extensions.vector(1536),
match_threshold FLOAT DEFAULT 0.75,
match_count INT DEFAULT 10,
filter_domain TEXT DEFAULT NULL -- optional: restrict to one site
)
RETURNS TABLE (
id UUID,
content TEXT,
source_url TEXT,
domain TEXT,
page_title TEXT,
section_heading TEXT,
chunk_index INT,
crawled_at TIMESTAMPTZ,
similarity FLOAT
)
LANGUAGE SQL STABLE
AS $$
SELECT
c.id,
c.content,
c.source_url,
c.domain,
c.page_title,
c.section_heading,
c.chunk_index,
c.crawled_at,
1 - (c.embedding <=> query_embedding) AS similarity
FROM source_chunks c
WHERE
(filter_domain IS NULL OR c.domain = filter_domain)
AND 1 - (c.embedding <=> query_embedding) > match_threshold
ORDER BY c.embedding <=> query_embedding -- ascending distance = descending similarity
LIMIT match_count;
$$;
Call from TypeScript (supabase-js):
const { data, error } = await supabase.rpc('match_documents', {
query_embedding: await embed(userQuery), // float[] from your embedding model
match_threshold: 0.75,
match_count: 8,
filter_domain: 'docs.firecrawl.dev' // null for all domains
});
// data: Array<{ id, content, source_url, section_heading, similarity }>
Call from Python:
results = supabase.rpc('match_documents', {
'query_embedding': embed(user_query),
'match_threshold': 0.75,
'match_count': 8,
}).execute()
Similarity threshold guidance:
> 0.85— high precision, low recall (near-exact match)0.75–0.85— balanced (recommended default)0.60–0.75— broad retrieval; useful for exploratory research< 0.60— nearly all chunks; not useful for RAG
6. Research Assistant Assembly
The pipeline produces chunks; a research assistant wraps them with a synthesis step:
User query
│
▼
embed(query) → query_embedding
│
▼
match_documents(query_embedding, threshold=0.75, count=8)
│
▼
Deduplicate by page (take top-N chunks per source_url to avoid over-representing one page)
│
▼
Build context window:
For each chunk:
"[Source: {page_title} — {section_heading}]\n{content}\n"
"[URL: {source_url}, crawled {crawled_at}]\n\n"
│
▼
LLM prompt:
SYSTEM: "You are a research assistant. Answer using only the provided sources.
Cite sources by URL for every claim."
USER: "{user_query}\n\n---SOURCES---\n{context_window}"
│
▼
Response with inline citations → vault note (type: literature or fleeting)
Provenance in the response: Every claim the assistant makes can be traced back to a source_url + section_heading + chunk_index + crawled_at. This triad is the attribution record — sufficient to re-fetch and verify any claim.
7. Ingestion Script Skeleton (Python)
import httpx, hashlib, asyncio
from supabase import create_client
FIRECRAWL_KEY = os.environ["FIRECRAWL_API_KEY"]
SUPABASE_URL = os.environ["SUPABASE_URL"]
SUPABASE_KEY = os.environ["SUPABASE_SERVICE_KEY"]
supabase = create_client(SUPABASE_URL, SUPABASE_KEY)
async def ingest_site(root_url: str, include_paths: list[str] = None):
# 1. Crawl
job = httpx.post("https://api.firecrawl.dev/v2/crawl",
headers={"Authorization": f"Bearer {FIRECRAWL_KEY}"},
json={
"url": root_url,
"limit": 200,
"maxDiscoveryDepth": 3,
"includePaths": include_paths or [],
"scrapeOptions": {"formats": ["markdown"], "onlyMainContent": True}
}).json()
# 2. Poll until done
while True:
status = httpx.get(f"https://api.firecrawl.dev/v2/crawl/{job['id']}",
headers={"Authorization": f"Bearer {FIRECRAWL_KEY}"}).json()
if status["status"] == "completed":
pages = status["data"]
break
await asyncio.sleep(5)
# 3. For each page: upsert canonical record + chunk + embed
for page in pages:
md = page["data"]["markdown"]
meta = page["data"]["metadata"]
url = meta["sourceURL"]
h = sha256(md)
# Upsert canonical record
supabase.table("source_pages").upsert({
"url": url, "title": meta.get("title"), "markdown": md,
"content_hash": h, "status_code": meta.get("statusCode"),
}, on_conflict="url").execute()
# Re-chunk only if content changed
existing = supabase.table("source_pages")\
.select("id").eq("url", url).eq("content_hash", h).execute()
if not existing.data:
continue # hash unchanged from previous run
page_id = existing.data[0]["id"]
# Delete stale chunks
supabase.table("source_chunks").delete().eq("page_id", page_id).execute()
# Chunk, embed, insert
chunks = chunk_markdown(md, {"url": url, "title": meta.get("title"),
"crawled_at": datetime.utcnow().isoformat()})
for chunk in chunks:
existing_embed = supabase.table("source_chunks")\
.select("id").eq("content_hash", chunk["content_hash"]).execute()
if existing_embed.data:
continue # reuse existing embedding
embedding = await embed(chunk["content"]) # your embedding model call
supabase.table("source_chunks").insert({
**chunk, "page_id": page_id, "embedding": embedding
}).execute()
8. Connections to Vault
This pipeline is the ingestion layer for all external sources referenced in the vault. Its output (chunks + provenance) feeds:
- firecrawl-crawling-capabilities — Implementation reference for
/crawl. - firecrawl-map-capabilities — Implementation reference for
/map. - firecrawl-scrape-capabilities — Implementation reference for
/scrape. - firecrawl-api-v2-reference — Technical foundations and authentication.
- semantic-embedding-pipeline — which handles *internal* vault note embeddings; this spec handles *external* source embeddings.
- community-report-generator — community reports can be grounded in external source chunks, not just internal notes.
- spec-memory-mcp — the Memory MCP Server can be extended to serve
source_chunksrows in addition to agent memories (samematch_documentspattern, separate resource:memory://sources). - lit-mcp-architecture — the
match_documentsRPC is a natural fit for an MCP Tool (search_sources), exposing this pipeline to any MCP-compatible agent.