ArchiveBox is a self-hosted application designed to preserve web content by capturing snapshots of URLs in multiple standard formats. It targets individuals, researchers, journalists, and legal professionals who need to retain control over their digital archives, avoiding reliance on third-party services that may disappear or change policies. By using tools like Chrome, wget, and yt-dlp, ArchiveBox extracts not just pages but embedded media, code repositories, and social media content.
Built with Python and Docker-first design, ArchiveBox supports CLI, web UI, Python API, and REST API for flexible integration. Data is stored in plain files (HTML, JSON, PNG, WARC, SQLite) that remain accessible without proprietary software. It integrates with Pocket, Pinboard, RSS, browser history, and GitHub/GitLab, and can automatically back up to the Wayback Machine for redundancy.
What You Get
- Multi-format archiving - Saves URLs as HTML+CSS+JS, singlefile HTML, PNG screenshots, PDFs, WARC archives, TXT article text, and metadata—all in open, non-proprietary formats.
- YouTube/SoundCloud media extraction - Automatically downloads MP3/MP4 files, subtitles, thumbnails, and metadata from video and audio links using yt-dlp.
- GitHub/GitLab repository archiving - Clones entire repositories, including READMEs and images, preserving source code and documentation.
- Scheduled archiving - Automates imports from RSS feeds, Pocket, Pinboard, browser history, and bookmarks via cron jobs or webhooks.
- Browser extension integration - One-click archiving from Chrome with the official ArchiveBox Exporter extension that sends URLs directly to your self-hosted instance.
- REST API & Python API - Programmatically add, query, and manage archives using a REST API (ALPHA) and Python library for custom automation and integrations.
- Wayback Machine backup - Automatically saves all archived pages to archive.org by default, providing off-site redundancy without extra configuration.
- Docker & CLI-first deployment - Install via Docker Compose, plain Docker, pip, or apt; all methods use the same data format and support identical features.
Common Use Cases
- Preserving legal evidence - Lawyers archive web pages containing contracts, social media posts, or forum discussions as tamper-proof PDFs and WARC files for court submissions.
- Journalists saving research sources - Reporters use ArchiveBox to capture cited articles, tweets, and videos before they’re deleted or paywalled, ensuring verifiable sourcing.
- Researchers building training datasets - Academics crawl social media and news sites to collect text, images, and metadata for LLM training or trend analysis.
- Personal bookmark backup - Users sync their Pocket/Pinboard collections to ArchiveBox to prevent link rot and retain full page snapshots without relying on third-party services.
Under The Hood
Architecture
- Django-based monolithic structure with tightly coupled models, views, and admin interfaces, lacking clear service or repository layers
- Configuration managed via environment variables and files without dependency injection, leading to implicit dependencies
- CLI and API layers share core logic, blurring boundaries between user-facing endpoints and background archiving tasks
- Plugin-like extensibility via external tools (Chromium, yt-dlp) is implemented through shell execution rather than formal APIs, limiting modularity
- Absence of interfaces or strategy patterns makes replacing archiving backends or search engines difficult without deep code changes
Tech Stack
- Python 3.13 backend powered by Django 6.0, enhanced with Django-Ninja for APIs and Django Extensions for dev tooling
- ASGI server Daphne with Supervisor for process management, ensuring production-grade stability
- Pydantic and benedict enable structured configuration and dynamic dictionary access across the codebase
- Headless browser and media capture rely on bundled Chromium, yt-dlp, and single-file within Docker
- Sonic provides built-in full-text search, configured via environment variables for indexed snapshot retrieval
- Pre-commit hooks enforce modern Python standards using Ruff, pyupgrade, and yesqa
Code Quality
- Extensive test coverage spans CLI, admin UI, migrations, and edge cases with direct SQLite and subprocess validation
- Strong separation between models, admin interfaces, and execution layers, with consistent Django ORM usage
- Migration tests verify schema evolution and data integrity across versions
- Error handling prioritizes observable system behavior through subprocess exit codes and output validation
- Clear naming conventions and reusable fixtures improve test readability and maintainability
- Type safety is rigorously enforced via mypy-compatible annotations across models, APIs, and test utilities
What Makes It Unique
- Event-driven workflows via abx_dl enable plugin-based crawling without modifying core logic
- Raw SQL migrations ensure deterministic, version-controlled database state across diverse environments
- Signal-based webhooks trigger real-time external automation on archival events, eliminating polling
- UUID7-based primary keys combine time-ordering and cryptographic security for API authentication
- Self-contained platform unifies browser automation, metadata extraction, and archiving into a single interface with no external dependencies for core functionality