NOTE

Firecrawl Crawling Capabilities

provenance_chunk_ids aliases provenance_source_ids provenance_retrieved2026-05-01T03:38:58Z date2026-04-30 titleFirecrawl Crawling Capabilities typepermanent provenance_agentclaude-chronicler authorgemini-cli statusactive

Firecrawl Crawling Capabilities

Firecrawl's crawl path is the vault's bounded multi-page ingestion mechanism: start from a root URL, traverse within configured limits, and return page content asynchronously as scrape-like results.

What Crawl Is For

  • Use crawl when one page is not enough and you need a constrained slice of a documentation site.
  • Prefer crawl over scrape when navigation depth and path filters matter.
  • Prefer map first when you need URL discovery without fetching page bodies.

Working Request Shape

The pipeline spec models crawl requests like:

{
  "url": "https://docs.example.com",
  "limit": 500,
  "maxDiscoveryDepth": 3,
  "includePaths": ["^/docs/", "^/api/"],
  "excludePaths": ["^/blog/", "^/changelog/", "\\.pdf$"],
  "scrapeOptions": {
    "formats": ["markdown"],
    "onlyMainContent": true
  }
}

The local ingest server uses the same structure with a smaller default limit and explicit polling against GET /v2/crawl/{id}.

Key Capabilities

  • Bounded traversal: limit and maxDiscoveryDepth keep the crawl from expanding indefinitely.
  • Path scoping: includePaths and excludePaths are the main cost-control and relevance-control levers.
  • Scrape inheritance: scrapeOptions lets crawl results reuse the same markdown-oriented extraction settings as single-page scrape operations.
  • Async execution: the initial request returns a job ID; results arrive only after polling completes.

Output Model

  • POST /v2/crawl returns an ID immediately.
  • GET /v2/crawl/{id} is polled until status == "completed".
  • Completed pages are treated as an array of scrape response objects, each with extracted content and metadata.

Operational Guidance

  • Use includePaths aggressively for docs sites; broad crawls waste credits and increase irrelevant retrieval.
  • Treat crawl as an extraction primitive, not a synthesis primitive. Content still needs canonical storage, chunking, and embedding downstream.
  • The current vault convention is markdown-first extraction with onlyMainContent: true.

Distinction from Neighboring Endpoints

Limits and Caveats

  • This note is grounded in the local spec and ingest implementation, not a full live Firecrawl reference.
  • Vendor-side limits, concurrency, and plan quotas may change; the pipeline spec currently notes page-tier limits and advises tight path filtering.
  • "LLM-ready Markdown" should be read as cleaner-than-raw HTML, not as a guarantee that every page arrives perfectly normalized.

Related