Firecrawl Crawling Capabilities
Firecrawl's crawl path is the vault's bounded multi-page ingestion mechanism: start from a root URL, traverse within configured limits, and return page content asynchronously as scrape-like results.
What Crawl Is For
- Use
crawlwhen one page is not enough and you need a constrained slice of a documentation site. - Prefer
crawloverscrapewhen navigation depth and path filters matter. - Prefer
mapfirst when you need URL discovery without fetching page bodies.
Working Request Shape
The pipeline spec models crawl requests like:
{
"url": "https://docs.example.com",
"limit": 500,
"maxDiscoveryDepth": 3,
"includePaths": ["^/docs/", "^/api/"],
"excludePaths": ["^/blog/", "^/changelog/", "\\.pdf$"],
"scrapeOptions": {
"formats": ["markdown"],
"onlyMainContent": true
}
}
The local ingest server uses the same structure with a smaller default limit and explicit polling against GET /v2/crawl/{id}.
Key Capabilities
- Bounded traversal:
limitandmaxDiscoveryDepthkeep the crawl from expanding indefinitely. - Path scoping:
includePathsandexcludePathsare the main cost-control and relevance-control levers. - Scrape inheritance:
scrapeOptionslets crawl results reuse the same markdown-oriented extraction settings as single-page scrape operations. - Async execution: the initial request returns a job ID; results arrive only after polling completes.
Output Model
POST /v2/crawlreturns an ID immediately.GET /v2/crawl/{id}is polled untilstatus == "completed".- Completed pages are treated as an array of scrape response objects, each with extracted content and metadata.
Operational Guidance
- Use
includePathsaggressively for docs sites; broad crawls waste credits and increase irrelevant retrieval. - Treat crawl as an extraction primitive, not a synthesis primitive. Content still needs canonical storage, chunking, and embedding downstream.
- The current vault convention is markdown-first extraction with
onlyMainContent: true.
Distinction from Neighboring Endpoints
- firecrawl-scrape-capabilities: single-page retrieval, synchronous.
- firecrawl-map-capabilities: discovery of URLs without page content.
- firecrawl-api-v2-reference: broader endpoint surface and shared request/response conventions.
Limits and Caveats
- This note is grounded in the local spec and ingest implementation, not a full live Firecrawl reference.
- Vendor-side limits, concurrency, and plan quotas may change; the pipeline spec currently notes page-tier limits and advises tight path filtering.
- "LLM-ready Markdown" should be read as cleaner-than-raw HTML, not as a guarantee that every page arrives perfectly normalized.