Paperless-ngx is a community-maintained document management system that transforms physical papers into a searchable digital archive. It’s designed for individuals and small teams who want to eliminate paper clutter while maintaining full control over their sensitive documents—ideal for home users, freelancers, and small businesses handling invoices, tax records, or contracts.
Built with Django and Angular, it uses Tesseract OCR and machine learning models to extract text from scanned PDFs and images. The system supports Docker-based deployment, integrates with document scanners via watch folders, and offers a web-based UI for browsing, tagging, and retrieving documents. It can be self-hosted on any Linux server with Docker, ensuring data privacy and offline access.
What You Get
- OCR-powered document indexing - Uses Tesseract OCR to extract text from scanned PDFs, images, and faxes, enabling full-text search across all uploaded documents.
- AI-powered document classification - Automatically assigns document types (e.g., invoice, receipt) using machine learning models trained on document content and structure.
- Tagging and metadata system - Allows manual and automatic tagging of documents with custom tags, dates, and document types for granular organization.
- Web-based document browser - Provides a responsive Angular UI to view, sort, and filter documents by tags, dates, or content with thumbnail previews.
- Watch folder automation - Monitors local directories for new files and automatically processes them via OCR and classification without manual upload.
- Docker-based deployment - Official Docker Compose setup simplifies installation on any Linux system with no dependency conflicts or system-wide installs.
Common Use Cases
- Managing personal tax documents - A homeowner scans and archives receipts, W-2s, and bank statements to easily retrieve them during tax season without physical filing.
- Running a small accounting firm - An accountant ingests client invoices and statements via a scanner, tags them by client, and retrieves them instantly using search or filters.
- Archiving legal or medical records - A paralegal or healthcare provider digitizes paper files with OCR to ensure compliance and enable keyword-based retrieval without cloud storage risks.
- Organizing business receipts for expense tracking - A freelancer uses a mobile scanner to capture receipts, which Paperless-ngx auto-classifies and tags for export to accounting software.
Under The Hood
Architecture
- Monolithic Django backend with tightly integrated document management, search, and AI components, lacking clear service boundaries but maintaining strong separation between data storage, indexing, and processing layers
- Search subsystem bypasses Django ORM for performance using a custom Tantivy pipeline with tightly coupled schema and query components, sacrificing modularity for speed
- Dependency injection is absent, with AI and search clients hard-coded to specific backends and configured via environment variables
- Frontend and backend are cleanly decoupled via REST APIs, enabling independent development and deployment cycles
- Authentication is modularized through Django-allauth and custom DRF-compatible auth classes with seamless OpenAPI integration
Tech Stack
- Python 3.11–3.14 backend powered by Django 5.2.10, Channels, and DRF, backed by PostgreSQL or MariaDB
- Angular frontend built with TypeScript, managed via pnpm and Node.js 24, with environment-aware build configurations
- Redis serves as the primary caching and task broker, with Celery and Flower handling background processing and monitoring
- Containerized deployment uses multi-stage Docker builds, s6-overlay, UV package manager, and Granian uvloop for high-performance serving
- Comprehensive AI/ML stack including FAISS, LlamaIndex, Sentence Transformers, Ollama, OpenAI, and Tika for semantic search and document analysis
Code Quality
- Extensive test coverage across both backend and frontend with pytest, Jasmine, and Karma validating core functionality and UI behavior
- Clear layering with Django models and management commands handling persistence, while Angular services manage state and UI interactions
- Robust error handling in frontend services using RxJS observables with graceful degradation and user-friendly notifications
- Consistent naming conventions and strong type safety enforced via TypeScript in the frontend and Django type hints in the backend
- Comprehensive linting, testing, and quality pipelines integrated into the build process for both Python and TypeScript codebases
What Makes It Unique
- Extensible plugin system for date parsing via Python entry points allows community-driven support for custom document naming and multilingual formats
- Dynamic workflow engine using Django ORM annotations enables complex, conditionally triggered document automation with minimal database overhead
- Native Tesseract OCR integration with configurable pixel limits and RTL language support provides self-contained, offline text extraction
- Client-side PDF editor with lazy-loaded rendering and drag-drop page reordering delivers a native-like editing experience in the browser
- Unified configuration UI with real-time sync between environment variables and config files eliminates deployment misconfigurations