Firecrawl is an open-source web data API designed to power AI agents with clean, structured web data. It solves the problem of unreliable, messy web scraping by providing a unified interface to extract, search, and interact with web content — handling JavaScript rendering, rotating proxies, rate limits, and dynamic content automatically. Built for developers building AI applications, it supports Python, Node.js, cURL, and CLI, and integrates with AI agents and MCP clients.
The platform is built with TypeScript and offers both a hosted service and self-hostable open-source components. It leverages modern browser automation and LLM-aware extraction to deliver high-fidelity markdown and JSON outputs, with scalable crawling and batch processing capabilities. Deployment options include cloud-hosted API, Docker, and direct GitHub source.
What You Get
- Search Endpoint - Query the web like a search engine and receive full-page markdown and JSON content from results, enabling AI agents to find and consume real-time information without prior URLs.
- Scrape Endpoint - Convert any URL into clean markdown, structured JSON, or screenshots with automatic JS rendering, ad blocking, and content extraction — optimized for LLM input and minimal token usage.
- Interact Endpoint - Scrape a page, then programmatically click, type, or scroll using AI prompts to extract dynamic content (e.g., search results, login flows) — enabling complex web interactions without manual scripting.
- Agent Endpoint - Describe what data you need (e.g., “Find pricing plans for Notion”) and Firecrawl’s AI agent autonomously navigates, searches, and extracts results with source URLs — no URLs required upfront.
- Crawl Endpoint - Scrape entire websites recursively with configurable limits and formats (markdown, JSON), with automatic job queuing, status tracking, and batch processing for large-scale data collection.
- Map Endpoint - Instantly discover all URLs on a website without crawling, enabling efficient site mapping for indexing, auditing, or pre-crawling planning.
- Batch Scrape - Submit thousands of URLs in a single request for asynchronous, parallel scraping with progress tracking and output aggregation.
- MCP Integration - Connect Firecrawl to any MCP-compatible AI agent (e.g., Claude Code, OpenCode) via a single JSON configuration to enable real-time web access without custom code.
- Agent Onboarding Skill - AI agents can fetch a pre-built skill to auto-provision API keys and onboard users to Firecrawl, enabling seamless integration into agent ecosystems.
- Media Parsing - Extract text from web-hosted PDFs, DOCX, and other document formats embedded in web pages, expanding data extraction beyond HTML.
Common Use Cases
- Building AI research agents - A researcher uses Firecrawl’s Agent endpoint to find and extract comparative pricing data from 10 SaaS websites without knowing their URLs beforehand.
- Powering AI coding assistants - A developer integrates Firecrawl’s MCP server into Claude Code to let their AI agent scrape documentation or API references in real time during code generation.
- Automating competitive intelligence - A marketing team runs weekly crawls of competitor websites to extract product updates, pricing changes, and feature lists in structured JSON for dashboards.
- Creating AI-powered web data pipelines - A data engineer uses Batch Scrape and Crawl to ingest and normalize content from 50,000 product pages for LLM training, using markdown outputs to reduce token costs.
Under The Hood
Architecture
- Monolithic API service tightly coupled with Playwright microservice via direct HTTP calls, lacking clear service boundaries or async communication patterns
- No dependency injection or inversion of control; services are hard-coded with static configurations, reducing testability and extensibility
- Configuration is fragmented across environment variables without centralized management or service registry
- Data structures serve dual roles as API responses and internal models, blurring domain boundaries and violating separation of concerns
Tech Stack
- Node.js backend with Express and TypeScript, leveraging Bull and Redis for task queuing and rate limiting
- Playwright-based scraping microservice running in Docker with proxy support and resource isolation
- PostgreSQL integrated with Supabase for authentication and storage, complemented by Redis for session and caching layers
- Rust components integrated via native bindings to handle performance-critical operations
- Docker Compose orchestrates a multi-service architecture with network isolation, logging, and resource constraints
Code Quality
- Extensive test coverage across unit, integration, and end-to-end layers with robust mocking of external dependencies
- Strong type safety enforced through TypeScript and Pydantic, with comprehensive input validation and explicit error handling
- Consistent naming, modular structure, and well-organized test suites that reflect real-world usage scenarios
- Linting and automation are well-established, though custom error classes lack uniformity across the codebase
What Makes It Unique
- LLM-driven URL generation from natural language prompts enables semantic, context-aware crawling without predefined lists
- Dynamic JSON schema generation from user prompts eliminates manual schema definition for structured data extraction
- CSS selector-based transformers enable precise, fine-grained data harvesting beyond standard DOM parsing
- Integrated cost tracking and credit-based billing tied directly to LLM usage, enabling sustainable API monetization
- Zero-data retention policy with team-level access controls ensures compliance in sensitive environments
- Unified pipeline combining SERP discovery, LLM-powered transformation, and schema-aware parsing in a single cohesive workflow