PrivateGPT is a locally run, open-source AI system that lets users query their own documents—PDFs, CSVs, DOCX, PPTX, and more—using LLMs without any internet connection. It solves the critical privacy concern in enterprise AI adoption by ensuring sensitive data never leaves the user’s environment, making it ideal for healthcare, legal, finance, and government sectors. Built with FastAPI and LlamaIndex, it provides a production-grade RAG pipeline with OpenAI API compatibility and a Gradio UI for immediate use.
The system is architected with dependency injection and modular components, allowing developers to swap LLMs (like LlamaCPP), embedding models (SentenceTransformers), and vector stores (Qdrant, Chroma) without changing core logic. It supports both high-level RAG abstractions and low-level API primitives, enabling rapid prototyping and enterprise-grade customization. Deployment options include Docker, local Python environments, and on-premise servers.
What You Get
- OpenAI API-compatible endpoint - Exposes /v1/chat/completions and /v1/embeddings endpoints that mirror OpenAI’s API, allowing seamless replacement of cloud-based LLMs with local models without code changes.
- Document ingestion pipeline - Automatically parses and processes PDFs, DOCX, PPTX, TXT, MD, CSV, and HTML files, extracting text, metadata, and splitting content into context-aware chunks for RAG.
- Local LLM support via LlamaCPP - Runs quantized LLMs (e.g., Llama 2, Mistral) directly on your machine using llama.cpp, eliminating cloud dependencies and enabling offline inference.
- Qdrant & Chroma vector stores - Built-in support for local vector databases to store and retrieve document embeddings, with Qdrant as the default for high-performance similarity search.
- Gradio-based UI - Provides an interactive web interface to chat with your documents, view sources, and test queries without writing any code.
- Low-level RAG primitives API - Exposes direct access to embedding generation, chunk retrieval, and LLM inference for building custom pipelines beyond standard RAG workflows.
Common Use Cases
- Running a confidential legal case analysis - A law firm ingests case documents and contracts into PrivateGPT to query them internally without exposing sensitive client data to third-party AI providers.
- Building a secure internal knowledge base - A healthcare provider uses PrivateGPT to let staff ask questions about patient protocols, policies, and research papers without violating HIPAA by sending data externally.
- Deploying AI in air-gapped environments - A defense contractor runs PrivateGPT on a disconnected server to analyze classified reports and technical manuals without any network connectivity.
- Replacing OpenAI API in enterprise apps - A financial institution migrates from OpenAI to PrivateGPT to comply with data residency laws, using the same API structure to avoid rewriting client code.
Under The Hood
Architecture
- Clear separation of concerns through plugin-style, configuration-driven component selection, enabling runtime swapping of LLM and embedding implementations without code modifications
- Dependency injection via centralized service registries and configuration files that dynamically instantiate backends like LlamaCPP, Ollama, or Azure OpenAI
- Modular ingestion pipeline with abstract base classes and concrete implementations for batch and parallelized data loading, ensuring consistent interfaces across strategies
- Decoupled infrastructure layers where external services (Qdrant, Ollama, Traefik) are managed independently via Docker Compose, with application logic interacting solely through configurable endpoints
- Prompt style abstraction using model-specific formatting classes that encapsulate tokenization rules, eliminating hard-coded templates and improving cross-model compatibility
- Clean API layer with controllers delegating all business logic to service components, adhering strictly to clean architecture principles
Tech Stack
- Python backend powered by FastAPI and Uvicorn, with Pydantic for robust data validation and Injector for dependency injection
- Pluggable LLM and embedding backends via LlamaIndex, supporting local, cloud, and open-source models with uniform interfaces
- Vector storage managed by Qdrant with optional PostgreSQL integration for persistent document metadata
- Environment-aware configuration using YAML files with environment variable interpolation for seamless local, Docker, and cloud deployments
- Dockerized deployment with multi-stage builds for CPU/GPU optimization, Traefik routing, and Ollama orchestration
- Comprehensive tooling with Poetry for dependency management, Ruff and Black for formatting, Mypy for type safety, and pytest for testing
Code Quality
- Extensive test coverage across core modules including prompt formatting, embedding generation, ingestion pipelines, and API endpoints
- Robust integration testing using FastAPI TestClient to validate HTTP behavior, response structures, and authentication flows with realistic payloads
- Well-organized test structure mirroring application layers, enabling focused development and maintainability
- Comprehensive error handling with explicit assertions for invalid inputs, misconfigurations, and access violations
- Strong type safety enforced through Pydantic models for all API inputs and outputs, ensuring serialization integrity
- Consistent naming and modular design that aligns test organization with system architecture, improving discoverability and extensibility
What Makes It Unique
- Native reranking integration using cross-encoders within the RAG pipeline, enabling dynamic document reordering before LLM inference—a rare feature in open-source RAG systems
- Pluggable LLM backends with dependency-isolated implementations that allow seamless switching between local and cloud models without code changes
- Automated tokenizer downloading and caching tied to Hugging Face authentication, enabling private, offline-capable deployments without manual model management
- Dual API layer design that exposes both high-level RAG abstractions and low-level component access, empowering users and developers alike
- Dynamic prompt style adaptation that automatically applies model-specific formatting rules, eliminating brittle template hardcoding and improving cross-model compatibility