ArchiveBox is an open-source, self-hosted web archiving tool designed to preserve digital content by capturing snapshots of web pages in multiple durable formats. It addresses the fragility of online information by saving not just the HTML, but also associated media, articles, code repositories, and social media content. Unlike cloud-based archiving services like the Wayback Machine, ArchiveBox gives users full control over their data—no third-party dependency. It’s ideal for individuals managing personal bookmarks, journalists preserving evidence, researchers building datasets, and organizations needing long-term digital preservation without relying on external platforms.
Built with Python and leveraging standard tools like Chrome, wget, yt-dlp, and singlefile, ArchiveBox stores all data in plain files (HTML, JSON, PNG, PDF, WARC) that remain accessible without proprietary software. It supports both CLI and web interface workflows, integrates with Pocket, Pinboard, RSS feeds, browser extensions, and can schedule automated archiving. The project emphasizes transparency, durability, and user ownership of archived data.
What You Get
- Multi-format archiving - Saves each URL as HTML, PNG screenshot, PDF, TXT article text, WARC, JSON metadata, favicon, and HTTP headers—all stored as standard files readable without ArchiveBox.
- Scheduled & automated imports - Automatically archive content from bookmarks, browser history, Pocket, Pinboard, RSS feeds, and YouTube/SoundCloud links via cron jobs or webhooks.
- Docker & CLI support - Install and run ArchiveBox via Docker Compose, plain Docker, pip, or system package managers with full feature parity across all methods.
- Self-hosted web UI - Access a fully functional web interface to browse, search, tag, and manage your archive with filters, tags, and detailed snapshot views.
- REST API & Python API - Programmatically add URLs, query archives, and integrate with other tools using the ALPHA REST API or BETA Python library.
- Support for authenticated content - Archive paywalled or login-protected pages using custom Chrome user data directories with saved sessions and cookies.
- Redundancy to archive.org - By default, every archived page is also submitted to the Internet Archive for offsite backup (configurable via settings).
- Cross-platform storage - Store archives on NFS, SMB, S3, Backblaze B2, or Google Drive via configurable storage backends.
Common Use Cases
- Building a personal knowledge base - A researcher saves 500+ academic papers and blog posts from bookmarks, automatically extracting text, PDFs, and metadata for offline reference.
- Legal evidence preservation - A lawyer archives web pages containing defamatory content or contractual terms, saving full HTML, screenshots, and WARC files for court admissibility.
- Journalist fact-checking workflow - A reporter uses ArchiveBox to capture live web content during investigations, ensuring cited sources remain accessible even after the original page is removed.
- DevOps teams managing digital heritage - Teams automate archiving of critical documentation, internal wikis, and public APIs to ensure continuity during migrations or vendor lock-in risks.
- Social media content backup - A user exports all their Instagram and YouTube favorites into MP4, thumbnails, and metadata files to avoid losing them when platforms change policies.
- Problem → Solution flow - The user loses access to a critical blog post after the domain expires → ArchiveBox saved it as HTML + PDF + screenshot weeks earlier, preserving all content and layout.
- Team/workflow scenario - A 10-person newsroom uses ArchiveBox to centralize all article references, tag them by topic, and export WARC files for compliance audits—all self-hosted behind their firewall.
Under The Hood
ArchiveBox is a Python-based web archiving platform built on Django, designed for self-hosted preservation of internet content. It combines CLI flexibility with a Django-powered admin interface to provide a unified solution for capturing and managing web archives.
Architecture
ArchiveBox adopts a monolithic yet modular architecture centered around Django, enabling clear separation of concerns across functional domains.
- The system is structured using Django apps to encapsulate distinct responsibilities such as CLI, API, and core archiving logic
- It follows layered design principles with defined boundaries between presentation, business logic, and data access layers
- Key design patterns include middleware for authentication and CORS handling, as well as strategy-based API token validation
- Component communication is managed through Django’s built-in systems, API endpoints, and model-driven data persistence
Tech Stack
ArchiveBox leverages Python and Django as its core foundation, integrating with a variety of system and third-party tools for content capture.
- Built with Python 3.13+ and Django 6.0+, using Django Ninja for API handling and Daphne as an ASGI server
- Relies on yt-dlp, Playwright, and Chromium for media capture, supported by Sonic search backend and Node.js extractors
- Development tools include uv for dependency management, PDM as the build system, and Ruff with MyPy for linting and type checking
- Testing is carried out using pytest, Django’s test runner, and coverage analysis with a wide range of test scenarios
Code Quality
ArchiveBox presents a mature Python codebase with consistent structure and extensible design, though some legacy patterns persist.
- The project maintains a reasonable level of test coverage with diverse test functions covering core operations like adding and updating archive entries
- Error handling is consistently applied throughout the codebase, though some generic exception handling suggests areas for refinement
- Code organization is generally consistent with adherence to Python conventions and clear module separation
- Technical debt indicators such as deprecated patterns or legacy migrations point to ongoing evolution and maintenance efforts
What Makes It Unique
ArchiveBox distinguishes itself through its plugin-driven architecture and hybrid CLI-admin approach, offering flexibility in web archiving workflows.
- Plugin architecture enables developers to extend core functionality without modifying base code, supporting custom crawlers and processing logic
- A built-in Django admin interface provides a visual dashboard for archive management while CLI tools support automation and scripting
- Unified abstraction layer allows integration of multiple archiving tools like wget and yt-dlp for flexible content capture strategies
- Rich developer tooling with structured data models and detailed API endpoints facilitates integration and workflow automation