Morphik
Morphik is an AI-native ingestion and retrieval engine that lets developers store, search, and reason over visually rich documents — scanned PDFs, manuals, slides, and video — without duct-taping together OCR, an embedding model, and a vector database.
Morphik is an AI-native ingestion and retrieval platform built for documents that don’t reduce cleanly to plain text — scanned PDFs, technical manuals with diagrams, slide decks, spreadsheets, and video. Instead of assembling a pipeline out of a separate OCR tool, embedding model, and vector database, Morphik bundles ingestion, multimodal embedding, storage, and retrieval into one FastAPI-based service that developers call through a Python SDK, a REST API, or the Model Context Protocol.
Its ingestion pipeline (core/services/ingestion_service.py, core/parser/morphik_parser.py) parses files with Docling and PyMuPDF, then represents pages using the ColPali technique — patch-level image embeddings that capture the layout of a chart, table, or diagram directly, rather than flattening it into lossy OCR text. Rules-based metadata extraction runs alongside ingestion to pull out bounding boxes, labels, and classifications. Everything is stored in Postgres with pgvector, using dedicated multi-vector store implementations (core/vector_store/multi_vector_store.py, fast_multivector_store.py) to handle the many-vectors-per-page shape that ColPali produces, and heavy ingestion work is offloaded to Redis-backed background workers (arq) so the API stays responsive.
A custom Rust extension (morphik_rust, built with PyO3) handles the CPU-bound parts of the pipeline — base64 encoding, text chunking with overlap, and binary quantization / Hamming-distance search over embeddings — to keep large ingestion batches fast without leaving Python. LiteLLM sits behind the embedding and completion layers, so teams can point Morphik at OpenAI, Anthropic, Google, or a local Ollama model without changing application code.
Morphik can be self-hosted via Docker Compose (Postgres/pgvector, Redis, the API, and an ingestion worker) or run as a managed Morphik Cloud service with a free tier. The core codebase ships under the Business Source License 1.1 — a source-available license, not an OSI-approved open source one — and a separate, stricter Enterprise license covers the Google Drive, GitHub, and Zotero connectors; see the Enterprise & Self-Hosting section below for exactly what that means in practice.
What You Get
- Multimodal ingestion pipeline - Docling- and PyMuPDF-backed parsing for PDFs, Office docs, images, and video that preserves page layout instead of collapsing it into plain text.
- ColPali visual embeddings - patch-level image embeddings of each page so charts, scanned tables, and diagrams are searchable without lossy OCR.
- Rules-based metadata extraction - configurable extraction of bounding boxes, labels, and document classifications during ingestion.
- Postgres + pgvector storage with multiple vector-store strategies - dedicated multi-vector and fast multi-vector store implementations built for ColPali’s many-vectors-per-page output.
- Redis/arq background ingestion workers - large document batches are processed off the request path so the API stays responsive.
- Python SDK, REST API, and MCP support - three first-class integration paths for wiring Morphik into agents and applications.
Common Use Cases
- Searching technical documentation with diagrams - engineering and manufacturing teams query PDF manuals and spec sheets and get answers grounded in the actual diagram or table, not a garbled OCR dump.
- Building RAG features without a bespoke pipeline - AI product teams self-host Morphik instead of stitching together an OCR tool, an embedding model, and a vector database themselves.
- Centralizing knowledge from external sources - teams with a Morphik Enterprise subscription use the Google Drive, GitHub, and Zotero connectors in
ee/to pull outside content into one searchable store. - Contract and compliance document review - legal and compliance teams ingest large batches of contracts and policies and use rules-based metadata extraction to flag key clauses and classifications.
- Searching visually dense research papers - research teams query papers full of figures and tables and retrieve answers tied to the specific chart or table, not just surrounding text.
Under The Hood
Architecture Morphik follows a layered service architecture: FastAPI routers (core/routes/documents.py, ingest.py, folders.py, v2.py) sit atop a service layer (core/services/document_service.py, ingestion_service.py) that orchestrates parsing, embedding, and storage, with core/app_factory.py wiring startup/shutdown lifecycle and core/dependencies.py providing shared resources like the Redis pool. Ingestion is decoupled from the request path through core/workers/ingestion_worker.py, an arq-based worker pulling jobs off a Redis queue, so large uploads don’t block the API. Storage, embedding, parsing, and reranking each sit behind their own base-class abstraction (core/vector_store/base_vector_store.py, core/embedding/base_embedding_model.py, core/parser/base_parser.py, core/reranker/base_reranker.py), with concrete pgvector, ColPali, Docling, and FlagEmbedding implementations plugged in through configuration (core/config.py, morphik.toml) rather than hardcoded — a genuine provider-swap seam. The service classes themselves are extensive, so a change to the core Document/DocumentResult models would ripple through nearly every vector store, parser, and route module that depends on them.
Tech Stack The backend is Python 3.10+ on FastAPI, with SQLAlchemy, asyncpg, and psycopg handling Postgres access and pgvector storing embeddings directly in Postgres rather than a separate vector database. Background ingestion runs on arq over Redis. LiteLLM abstracts completions and embeddings across OpenAI, Anthropic, Google, and Ollama; ColPali (via the illuin-tech colpali-engine) and FlagEmbedding provide multimodal and text embeddings respectively, with Docling, PyMuPDF, pdf2image, and WeasyPrint handling document parsing and rendering. A dedicated Rust crate (morphik_rust), compiled to a Python extension with PyO3, offloads base64, chunking, and binary-quantization work from the Python hot path. Deployment is Docker Compose based, with services for the API, an ingestion worker, Postgres/pgvector, Redis, and an optional local Ollama container; OpenTelemetry and Sentry integrations are wired in for observability.
Code Quality There is a real pytest suite — unit tests cover metadata filter SQL generation, multi-vector storage, ColPali embedding and rendering, video parsing, reranking, and request-model validation, plus an integration test with fixture PDFs — and pre-commit hooks enforce isort, black, and ruff formatting. Type hints and Pydantic models, including Literal-typed settings in core/config.py, are used consistently, and custom exception types such as InvalidMetadataFilterError replace bare excepts at least in the database layer. The gap is CI enforcement: the GitHub Actions workflows only build the Docker image and run a secret-leak scan on pull requests — none of them execute the test suite or linters, so passing tests is a local/pre-commit convention rather than a merge gate.
What Makes It Unique Morphik’s distinguishing choice is treating document pages as images first: ColPali-style patch embeddings let it search scanned tables, charts, and diagrams by their visual layout instead of running OCR and hoping the extracted text survives intact — a direct answer to the visually-rich-documents-break-RAG problem the project calls out. It pairs that with rules-based structured-metadata extraction at ingestion time and a hand-written Rust extension for binary quantization and Hamming-distance search, which is an unusual amount of low-level optimization for a project this size to bring in-house rather than lean on an off-the-shelf vector database’s ANN index. None of these techniques were invented by Morphik — ColPali is published research and binary quantization is a known technique — but combining them into one ingestion-to-retrieval pipeline, rather than requiring users to assemble it themselves, is a genuine product bet that most vector-DB-plus-OCR RAG stacks don’t make.
Self-Hosting
Licensing Model Morphik Core ships under the Business Source License 1.1 (BUSL-1.1), a source-available license — not an OSI-approved open source license. The license text itself states plainly that it “is not an Open Source license,” though it converts to Apache License 2.0 on a stated future date. This is a source-available license with usage restrictions, distinct from a permissive or copyleft open source license.
Self-Hosting Restrictions
- Production use of the core codebase is free only while your Morphik deployment’s attributable gross revenue stays under $2,000/month (the license’s “Additional Use Grant”); above that threshold you must purchase a commercial key from Morphik.
- The
ee/directory (Google Drive, GitHub, and Zotero connectors, plus the EE UI components) ships under a separate, stricter “Morphik Enterprise License.” It requires an active Morphik subscription / agreement to Morphik’s Subscription Terms of Service for ANY production use, regardless of revenue — the $2,000/month grant does not apply to this code. You may copy and modify it for development and testing without a subscription, but not run it in production. - Redistribution, sublicensing, or resale of the Licensed Work itself is not permitted under BUSL 1.1.
Enterprise Features
- Connectors for pulling in external content sources — Google Drive, GitHub, and Zotero — live in
ee/services/connectors/and are gated behind the Enterprise license described above. - The README points to a hosted Morphik Cloud offering with a free tier and usage-based pricing for larger workloads; exact pricing tiers weren’t verified beyond what’s stated in the README and LICENSE.
Cloud vs Self-Hosted The maintainers state directly in the README that self-hosted deployments get limited support — “due to limited resources, we cannot provide full support for self-hosted deployments” — while Morphik Cloud is positioned as the recommended, fully supported path.
License Key Required
Yes, in two situations verified from the repo: (1) once a self-hosted deployment’s revenue crosses $2,000/month under the core BUSL grant, a commercial key must be purchased, and (2) any production use of the ee/ connector code requires a Morphik Enterprise subscription regardless of revenue.
Future Open Source Per the LICENSE file, the Change Date is 2029-06-18 (or four years after a given version’s first public release, whichever comes first), at which point that version of the core Licensed Work automatically re-licenses to Apache License 2.0.
Related Apps
Ollama
AI Development · Developer Tools
Run Llama, Gemma, DeepSeek, and other open LLMs on your own machine with one command and an OpenAI-compatible API.
Ollama
MITLangflow
AI Agents · AI Development
Build, test, and deploy AI agents and RAG workflows visually with native API and MCP server export.
Langflow
MITDify
No Code Platforms · AI Development · Developer Tools
Visual LLM workflow platform with RAG pipelines, agent capabilities, and model management for building production AI applications.