Production-grade open source web scraping engine that turns any URL into clean markdown for AI agents — with built-in anti-bot bypass, proxy rotation, and browser session management.
Reader is a TypeScript library and CLI that gives AI agents reliable, production-grade access to the web. Built on top of Ulixee Hero — a headless browser purpose-built for stealth — Reader handles TLS fingerprinting, Cloudflare challenge detection, navigator spoofing, and WebRTC masking so scrapers don’t get blocked before they even reach the content.
The library exposes three composable primitives: scrape() converts one or more URLs into clean markdown or HTML with automatic main-content extraction, noise removal (nav, cookies banners, footers), and batch concurrency control; crawl() performs BFS link discovery from a starting URL with configurable depth and page limits, optionally scraping every discovered page; and browser() launches a stealthed Chrome instance accessible over the Chrome DevTools Protocol (CDP), letting Playwright or Puppeteer connect to a pre-configured stealth context with zero code changes.
Under the hood, Reader manages a tiered browser pool keyed by proxy URL, with automatic proxy health tracking (circuit breaker, cooldown, revival), per-domain configuration profiles, robots.txt compliance, rate limiting, and a long-running daemon mode that keeps the pool warm between CLI commands. The markdown conversion uses a custom Rust-based library (supermarkdown) via NAPI for high-performance, panic-safe HTML-to-markdown transformation.
The project ships with an npm package (@vakra-dev/reader), a companion cloud offering (@vakra-dev/reader-js) for those who prefer managed infrastructure, and a comprehensive CLI with scrape, crawl, browser, start, stop, and status subcommands. It targets Node.js 18+ and is licensed under Apache 2.0.
scrape(), crawl(), and browser() methods that share a managed browser pool and initialize lazily on first use.webdriver=false), WebRTC masking, and automatic Cloudflare challenge detection and waiting baked into every request.connectOverCDP(), with no stealth configuration needed.reader start to keep browser pools warm between CLI invocations; subsequent scrape and crawl commands auto-connect to the daemon, eliminating cold-start latency.npx reader scrape, crawl, browser, start, stop, and status subcommands with options for concurrency, output format, proxy, timeout, CSS selectors, and verbose logging.Architecture Reader follows a layered, modular design organized around three public primitives (scrape, crawl, browser) that each delegate to a shared TieredBrowserPool managed by a ReaderClient lifecycle object. The Scraper class handles per-URL retry logic — datacenter proxy first, automatic escalation to residential on block detection, hard deadline enforcement — while the Crawler drives BFS link discovery and delegates individual page fetches back to the scraper layer. A dedicated EngineOrchestrator sits between the scraper and the Hero browser, applying domain profiles, running block-detection checks after each fetch, and choosing whether to retry with a higher-cost proxy tier. Browser sessions follow a separate path: each session gets its own dedicated Hero process rather than sharing the pool, giving callers complete isolation for long-lived interactive automation. The daemon process wraps the client in a persistent HTTP server so the pool survives across CLI invocations.
Tech Stack
The project is written in TypeScript 5 running on Node.js 18+, compiled to ESM with tsup and Rollup. The core browser engine is Ulixee Hero 2.0 alpha, a Chromium-based headless browser specifically engineered for TLS fingerprint matching and anti-bot stealth, with Hero Core managing the shared browser runtime. HTML-to-markdown conversion uses the project’s own @vakra-dev/supermarkdown Rust NAPI module, providing panic-safe, high-performance conversion. Content parsing uses linkedom for DOM manipulation without a full browser context. Concurrency is controlled with p-limit; logging uses Pino with pino-pretty; the CLI is built on Commander.js. The test suite runs under Vitest; linting uses typescript-eslint and Prettier.
Code Quality The codebase has extensive unit test coverage with over 20 test files spanning block detection, content cleaning, crawler logic, daemon dispatch, domain profiles, error types, health tracking, markdown formatting, metadata extraction, proxy configuration, robots parsing, scraper retry logic, tiered pool management, and URL utilities. Test files use Vitest’s globals mode with a 30-second timeout for browser-dependent tests. Error handling is explicit and typed: a structured error hierarchy exports named classes (NetworkError, CloudflareError, BotDetectedError, ProxyExhaustedError, etc.) that the scraper catches selectively to distinguish retryable failures from non-retryable ones like DNS errors or robots.txt blocks. The codebase uses TypeScript strict mode, consistent JSDoc comments on all public methods, and ESLint plus Prettier for consistent formatting.
What Makes It Unique Reader’s most distinctive technical choice is its foundation on Ulixee Hero rather than raw Playwright or Puppeteer. Hero’s MITM proxy layer injects real Chrome TLS fingerprints at the network level — not at the browser JS level — making it dramatically harder for anti-bot systems to detect automation via JA3/JA4 fingerprinting or TCP/IP stack analysis. The companion supermarkdown Rust library adds another layer of differentiation: rather than using a pure-JS HTML-to-markdown converter, it wraps a Rust implementation with NAPI bindings and catch_unwind panic safety, so pathological HTML inputs that would crash a JS converter are handled gracefully with a fallback text extraction path. The tiered proxy pool with automatic escalation is also uncommon in open source scrapers: most libraries make the caller manage proxy selection per request, while Reader models it as a pool-level concern with health-circuit-breaking baked in.
Reader is licensed under the Apache License 2.0, which is a permissive open source license. You can use it commercially, modify it, distribute it, and sublicense it without requiring your own application to be open-sourced. The only obligations are to preserve copyright and license notices and to indicate changes you made. There are no copyleft or viral provisions, so embedding Reader in a proprietary product or SaaS offering is straightforward.
Running Reader yourself requires a Node.js 18+ environment with Chrome or Chromium available (the package bundles a Chrome 139 binary for x86_64 Linux/macOS; Apple Silicon users must point to a system Chrome). For production deployments the browser pool consumes meaningful RAM — each Hero browser instance holds a Chromium process with its own memory space, so a pool of five browsers at 50 pages each before recycling will require several gigabytes of available memory on the host. There is no built-in horizontal scaling or distributed coordination: the TieredBrowserPool is single-process, so scaling beyond what one Node.js process can manage requires running multiple Reader instances behind your own work-distribution layer. The daemon mode keeps the pool warm between requests but is still single-host.
The companion cloud offering at app.reader.dev (@vakra-dev/reader-js) abstracts away all of the infrastructure: no browser process to manage, no proxy pool to configure, no memory pressure to monitor. The cloud tier adds managed proxy rotation, SLA-backed uptime, automatic scaling, and a hosted API key workflow. If your use case involves unpredictable burst traffic, very large crawls, or you prefer not to manage browser infrastructure, the managed tier is the pragmatic choice. The self-hosted path gives you full data sovereignty, no per-request pricing, and the ability to run air-gapped or within a private network.
No Code Platforms · AI Development · Developer Tools
Visual LLM workflow platform with RAG pipelines, agent capabilities, and model management for building production AI applications.
Developer Tools · Game Development · Design Tools
Free, MIT-licensed 2D and 3D game engine with one-click multi-platform export and no royalties.
Developer Tools · Databases · Search
The open-source Postgres development platform that replaces Firebase with authentication, real-time APIs, edge functions, storage, and vector embeddings — all built on PostgreSQL.