reader

Name: reader
Rating: 5 (548 reviews)

Production-grade open source web scraping engine that turns any URL into clean markdown for AI agents — with built-in anti-bot bypass, proxy rotation, and browser session management.

548stars

38forks

Apache License 2.0

TypeScript

View Source Visit Website

On This Page

Reader is a TypeScript library and CLI that gives AI agents reliable, production-grade access to the web. Built on top of Ulixee Hero — a headless browser purpose-built for stealth — Reader handles TLS fingerprinting, Cloudflare challenge detection, navigator spoofing, and WebRTC masking so scrapers don’t get blocked before they even reach the content.

The library exposes three composable primitives: scrape() converts one or more URLs into clean markdown or HTML with automatic main-content extraction, noise removal (nav, cookies banners, footers), and batch concurrency control; crawl() performs BFS link discovery from a starting URL with configurable depth and page limits, optionally scraping every discovered page; and browser() launches a stealthed Chrome instance accessible over the Chrome DevTools Protocol (CDP), letting Playwright or Puppeteer connect to a pre-configured stealth context with zero code changes.

Under the hood, Reader manages a tiered browser pool keyed by proxy URL, with automatic proxy health tracking (circuit breaker, cooldown, revival), per-domain configuration profiles, robots.txt compliance, rate limiting, and a long-running daemon mode that keeps the pool warm between CLI commands. The markdown conversion uses a custom Rust-based library (supermarkdown) via NAPI for high-performance, panic-safe HTML-to-markdown transformation.

The project ships with an npm package (@vakra-dev/reader), a companion cloud offering (@vakra-dev/reader-js) for those who prefer managed infrastructure, and a comprehensive CLI with scrape, crawl, browser, start, stop, and status subcommands. It targets Node.js 18+ and is licensed under Apache 2.0.

What You Get

ReaderClient API — A single TypeScript class with scrape(), crawl(), and browser() methods that share a managed browser pool and initialize lazily on first use.
Anti-bot stealth — TLS fingerprinting matched to Chrome, navigator property spoofing (webdriver=false), WebRTC masking, and automatic Cloudflare challenge detection and waiting baked into every request.
Tiered proxy pool — Configure datacenter and residential proxy pools separately; Reader auto-escalates from datacenter to residential when a site blocks the cheaper tier, with per-proxy circuit breaking and cooldown.
Clean markdown output — Rust-based supermarkdown converts HTML to LLM-ready markdown with configurable CSS selector inclusion/exclusion and automatic removal of nav bars, cookie banners, and footers.
BFS website crawler — Discover all pages from a root URL with configurable depth and page limits, with optional inline scraping of every discovered URL in a single call.
CDP browser sessions — Launch a stealthed Chrome and get a WebSocket endpoint that any Playwright or Puppeteer script can connect to via connectOverCDP(), with no stealth configuration needed.
Daemon mode — Run reader start to keep browser pools warm between CLI invocations; subsequent scrape and crawl commands auto-connect to the daemon, eliminating cold-start latency.
Full CLI — npx reader scrape, crawl, browser, start, stop, and status subcommands with options for concurrency, output format, proxy, timeout, CSS selectors, and verbose logging.

Common Use Cases

AI agent web access — Feed Reader’s markdown output directly into an LLM or RAG pipeline so agents can retrieve up-to-date information from any website without hitting API paywalls.
Documentation ingestion — Crawl an entire docs site at a given depth and scrape every page to build a searchable knowledge base or fine-tuning dataset.
Competitive monitoring — Batch-scrape competitor pricing pages, product listings, or news sites on a schedule using the daemon and the concurrency controls.
Anti-bot bypass automation — Drive sites that block Playwright and Puppeteer by connecting them to a Reader browser session whose stealth fingerprint is indistinguishable from a real Chrome user.
Data extraction pipelines — Integrate Reader into a TypeScript data pipeline to turn unstructured web content into structured markdown for downstream processing.
Proxy health testing — Use the built-in proxy verification and health tracker to validate proxy pool health before and during large-scale scraping runs.

Under The Hood

Architecture Reader follows a layered, modular design organized around three public primitives (scrape, crawl, browser) that each delegate to a shared TieredBrowserPool managed by a ReaderClient lifecycle object. The Scraper class handles per-URL retry logic — datacenter proxy first, automatic escalation to residential on block detection, hard deadline enforcement — while the Crawler drives BFS link discovery and delegates individual page fetches back to the scraper layer. A dedicated EngineOrchestrator sits between the scraper and the Hero browser, applying domain profiles, running block-detection checks after each fetch, and choosing whether to retry with a higher-cost proxy tier. Browser sessions follow a separate path: each session gets its own dedicated Hero process rather than sharing the pool, giving callers complete isolation for long-lived interactive automation. The daemon process wraps the client in a persistent HTTP server so the pool survives across CLI invocations.

Tech Stack The project is written in TypeScript 5 running on Node.js 18+, compiled to ESM with tsup and Rollup. The core browser engine is Ulixee Hero 2.0 alpha, a Chromium-based headless browser specifically engineered for TLS fingerprint matching and anti-bot stealth, with Hero Core managing the shared browser runtime. HTML-to-markdown conversion uses the project’s own @vakra-dev/supermarkdown Rust NAPI module, providing panic-safe, high-performance conversion. Content parsing uses linkedom for DOM manipulation without a full browser context. Concurrency is controlled with p-limit; logging uses Pino with pino-pretty; the CLI is built on Commander.js. The test suite runs under Vitest; linting uses typescript-eslint and Prettier.

Code Quality The codebase has extensive unit test coverage with over 20 test files spanning block detection, content cleaning, crawler logic, daemon dispatch, domain profiles, error types, health tracking, markdown formatting, metadata extraction, proxy configuration, robots parsing, scraper retry logic, tiered pool management, and URL utilities. Test files use Vitest’s globals mode with a 30-second timeout for browser-dependent tests. Error handling is explicit and typed: a structured error hierarchy exports named classes (NetworkError, CloudflareError, BotDetectedError, ProxyExhaustedError, etc.) that the scraper catches selectively to distinguish retryable failures from non-retryable ones like DNS errors or robots.txt blocks. The codebase uses TypeScript strict mode, consistent JSDoc comments on all public methods, and ESLint plus Prettier for consistent formatting.

What Makes It Unique Reader’s most distinctive technical choice is its foundation on Ulixee Hero rather than raw Playwright or Puppeteer. Hero’s MITM proxy layer injects real Chrome TLS fingerprints at the network level — not at the browser JS level — making it dramatically harder for anti-bot systems to detect automation via JA3/JA4 fingerprinting or TCP/IP stack analysis. The companion supermarkdown Rust library adds another layer of differentiation: rather than using a pure-JS HTML-to-markdown converter, it wraps a Rust implementation with NAPI bindings and catch_unwind panic safety, so pathological HTML inputs that would crash a JS converter are handled gracefully with a fallback text extraction path. The tiered proxy pool with automatic escalation is also uncommon in open source scrapers: most libraries make the caller manage proxy selection per request, while Reader models it as a pool-level concern with health-circuit-breaking baked in.

Self-Hosting

Reader is licensed under the Apache License 2.0, which is a permissive open source license. You can use it commercially, modify it, distribute it, and sublicense it without requiring your own application to be open-sourced. The only obligations are to preserve copyright and license notices and to indicate changes you made. There are no copyleft or viral provisions, so embedding Reader in a proprietary product or SaaS offering is straightforward.

Running Reader yourself requires a Node.js 18+ environment with Chrome or Chromium available (the package bundles a Chrome 139 binary for x86_64 Linux/macOS; Apple Silicon users must point to a system Chrome). For production deployments the browser pool consumes meaningful RAM — each Hero browser instance holds a Chromium process with its own memory space, so a pool of five browsers at 50 pages each before recycling will require several gigabytes of available memory on the host. There is no built-in horizontal scaling or distributed coordination: the TieredBrowserPool is single-process, so scaling beyond what one Node.js process can manage requires running multiple Reader instances behind your own work-distribution layer. The daemon mode keeps the pool warm between requests but is still single-host.

The companion cloud offering at app.reader.dev (@vakra-dev/reader-js) abstracts away all of the infrastructure: no browser process to manage, no proxy pool to configure, no memory pressure to monitor. The cloud tier adds managed proxy rotation, SLA-backed uptime, automatic scaling, and a hosted API key workflow. If your use case involves unpredictable burst traffic, very large crawls, or you prefer not to manage browser infrastructure, the managed tier is the pragmatic choice. The self-hosted path gives you full data sovereignty, no per-request pricing, and the ability to run air-gapped or within a private network.

On This Page