knowhere

Name: knowhere
Rating: 5 (1854 reviews)

Transform messy, unstructured documents into persistent, navigable memory that AI agents can actually use.

1.9Kstars

226forks

Apache License 2.0

Python

View Source Visit Website

On This Page

Knowhere is an open-source document memory infrastructure stack that sits between raw files and AI agents. Instead of dumping flat text at an LLM, Knowhere ingests PDFs, Office files, images, and markdown, then reconstructs their full hierarchical structure using a proprietary tree-building algorithm — preserving headings, sections, tables, and cross-document relationships as a navigable memory graph.

The platform runs in two stages: build and retrieve. During the build phase, documents are routed to specialized parsers (defaulting to MinerU for PDFs), then a multi-pass document agent using a ReAct-style loop analyzes page anatomy, detects table-of-contents structures, assigns heading levels, and organizes chunks with full section-path context. During retrieval, a hybrid engine fuses keyword (BM25), path, semantic, and vector channels using Reciprocal Rank Fusion, then an LLM-driven navigation agent walks the section tree to drill into the most relevant regions.

Knowhere exposes its retrieval engine as an MCP (Model Context Protocol) server, making it natively compatible with Claude, Cursor, and other agentic tool frameworks. Every result carries traceable source paths — document, section, chunk, and linked assets — so downstream agents can cite evidence rather than hallucinate. Internal benchmarks show +36% first-try accuracy and +11% recall over feeding raw documents directly to agents.

The full stack is self-hostable via Docker Compose, with official Python and Node.js SDKs available for cloud API access. A companion dashboard, worker service, and shared infrastructure package ship as separate repositories in the Ontos-AI ecosystem, all orchestrated by a uv workspace.

What You Get

A FastAPI-based backend that auto-migrates its PostgreSQL schema on startup and exposes OpenAPI docs at /docs
A Celery/gevent worker service that processes ingestion jobs with per-job state gates, billing controls, and source preparation
A proprietary Tree-like heading hierarchy algorithm that reconstructs document structure from flat parsed output instead of flattening it
A ReAct-style document profile agent with token budget envelopes per stage (TOC confirm, coarse planner, structural react, page locate, page tagging)
A hybrid retrieval engine fusing BM25 lexical, path, content, and semantic channels via Reciprocal Rank Fusion (RRF)
An LLM-driven agentic retrieval orchestrator that navigates section trees with EXPAND/BACK/SEARCH_IMAGES/SEARCH_TABLES/FINISH actions
A built-in MCP (Model Context Protocol) server at /mcp exposing a retrieval.query tool for direct agent integration
Official Python and Node.js SDKs plus a self-hosted Docker Compose stack with LocalStack, PostgreSQL, and Redis

Common Use Cases

Building RAG pipelines over technical PDFs, legal contracts, or research papers where flat chunking destroys cross-section context
Connecting Claude, Cursor, or other MCP-compatible agents to a private document knowledge base without a vector-only retrieval hack
Processing large document collections (300-500+ page atlases, technical drawings) with layout-aware parsing and atlas classification
Multi-modal document Q&A where tables, charts, and embedded images need to be described, linked, and cited alongside text chunks
Enterprise knowledge management where evidence traceability (document → section → chunk → asset) is required for compliance
Self-hosted AI data pipelines that need full control over where documents are stored and how retrieval models are configured

Under The Hood

Architecture

Knowhere follows a distributed, service-oriented architecture split across three deployable units sharing a common Python package. The apps/api FastAPI service handles all HTTP routing, authentication, rate limiting, billing, webhooks, and exposes the MCP retrieval endpoint; it runs Alembic migrations on startup and warms a PostgreSQL async connection pool via asyncpg. The apps/worker Celery/gevent service owns all CPU-bound document processing — ingestion orchestration, parsing, structural analysis, and the document profile agent — and is monkey-patched with gevent for cooperative scheduling. The packages/shared-python package contains all shared models, database sessions, Redis clients, retrieval services, storage adapters, and chunk structures. The retrieval path is entirely in shared and called from both API and the MCP server. Separation of concerns is enforced structurally: routes are thin adapters over typed workflow outcomes (documented in ADR-0001 and ADR-0002), the worker uses a per-job state gate and billing guard before any compute runs, and the agentic retrieval orchestrator is policy-explicit with configurable budget envelopes per stage (ADR-0003).

Tech Stack

The project uses Python 3.11+ throughout, managed as a uv workspace with three packages. The API is built on FastAPI 0.135 + Uvicorn 0.34 + SQLAlchemy 2.0 (async), backed by PostgreSQL with pgvector for vector storage and Redis 5 for caching, rate limiting, and Celery task queuing via celery-redbeat. Document parsing leans on MinerU as the default PDF backend, with python-docx, python-pptx, pypdf, pymupdf, pptx2md, openpyxl, and markitdown for other formats; pandas handles the intermediate dataframe representation of chunk/heading data. The document profile agent uses the OpenAI SDK (model-agnostic via env vars — supports DeepSeek, Qwen-VL, GPT, Zhipu, Volcengine). The MCP server is built on FastMCP from the official mcp package (1.27+). Observability is wired through Logfire with optional PostHog telemetry for self-hosted deployments, and Stripe handles cloud billing. Type checking uses Pyright in basic mode; linting uses Ruff.

Code Quality

The codebase has extensive contract test coverage — 43 test files split across apps/api/tests/contract/ and apps/worker/tests/contract/, covering agentic discovery, retrieval, billing, API key auth, document lifecycle, worker bootstrap, parse task execution, and more. Tests use pytest with pytest-asyncio, pytest-alembic, fakeredis, and pytest-postgresql. Three Architecture Decision Records document structural invariants. The codebase uses Pydantic v2 for all data validation, typed workflow outcome enums throughout the worker pipeline, and Pyright in basic mode with Ruff for linting. Gevent monkey-patching at the worker entry point is explicitly documented and isolated. Error handling is explicit with loguru structured logging at every stage gate. Some areas lack unit tests (retrieval channel scoring, heading tree logic) but the contract test surface is broad.

What Makes It Unique

Knowhere’s primary technical differentiator is that it treats document hierarchy as a first-class data structure rather than an afterthought. The proprietary tree-building algorithm reconstructs heading levels and section paths from raw parser output using a stack-based parent-child traversal, then propagates these paths into every chunk’s metadata — so a chunk in “Chapter 3 / Section 2.1 / Subsection a” carries its full lineage. The agentic retrieval orchestrator then navigates this tree with an LLM observe-act loop (EXPAND/BACK/FINISH actions), simulating how a human reader drills into relevant sections rather than trusting flat cosine similarity. The MCP server integration is first-class and stateless-HTTP — not a bolt-on — making Knowhere natively consumable by Claude Code, Cursor, and any MCP-capable agent without a custom integration layer.

Self-Hosting

Licensing Model Apache 2.0 licensed — all features available in self-hosted deployments with no restrictions or license keys required.

Self-Hosting The full stack (API, worker, dashboard) ships as a Docker Compose configuration in the separate knowhere-self-hosted repository. No feature is gated behind a cloud plan in the self-hosted path.

Cloud vs Self-Hosted Knowhere Cloud at knowhereto.ai offers a managed API with $5 free credits on registration. The cloud offering provides the same capabilities as self-hosted; the difference is operational (managed infrastructure vs. bring your own).

On This Page