cocoindex

Name: cocoindex
Rating: 5 (10603 reviews)

An incremental data indexing engine that keeps AI agent context perpetually fresh by reprocessing only what changed.

10.6Kstars

821forks

Apache License 2.0

Rust

View Source Visit Website

On This Page

CocoIndex is an open-source framework that solves one of the most persistent problems in production AI applications: stale context. Traditional RAG pipelines re-index entire datasets on every run, wasting compute and introducing latency. CocoIndex flips this model by tracking exactly which source documents changed and recomputing only the affected delta — so your vector stores, knowledge graphs, and search indexes stay continuously synchronized with their upstream sources.

The framework ships a declarative Python API where you describe the desired state of your index — sources, transformations, and target stores — and CocoIndex’s incremental engine maintains that contract automatically. Functions marked with memo=True are content-addressed by hash, meaning identical inputs never get re-embedded or re-transformed, regardless of how many times you run the pipeline. This cache-first approach delivers dramatic savings: typical re-indexing runs achieve 80-90% cache hit rates on unchanged content.

Under the hood, CocoIndex’s core engine is written in Rust and exposed to Python via PyO3, giving you native performance for the heavy lifting while retaining the expressiveness of Python for pipeline definition. The connector ecosystem spans local filesystems, PostgreSQL, Google Drive, Amazon S3, Kafka, Neo4j, FalkorDB, and a dozen vector stores including pgvector, Qdrant, LanceDB, and SurrealDB. A first-party MCP server called CocoIndex-code extends the framework into AI coding agents, giving tools like Claude Code and Cursor AST-aware, incrementally refreshed views of entire repositories.

CocoIndex targets teams building production AI systems that require reliable, low-latency context retrieval at scale. Whether you are building a semantic code search tool, a meeting-notes knowledge graph, a multi-modal image search pipeline, or a document Q&A system, CocoIndex handles the infrastructure of keeping derived indexes in sync so application logic can focus on reasoning rather than data plumbing.

What You Get

A declarative Python API for defining indexing pipelines against any combination of sources and target stores
Content-addressed memoization that skips re-embedding or re-transforming any input that has not changed since the last run
A high-performance Rust core exposed via PyO3 for low-latency incremental execution at scale
Pre-built connectors for local filesystems, PostgreSQL, Google Drive, Amazon S3, Kafka, and OCI Object Storage as sources
Pre-built target connectors for pgvector, Qdrant, LanceDB, SurrealDB, Turbopuffer, SQLite-vec, Neo4j, FalkorDB, and Apache Doris
Built-in text splitting and AST-aware chunking for code files across Python, TypeScript, Rust, and Go
A first-party CocoIndex-code MCP server that gives AI coding agents a live, incrementally refreshed semantic index of entire repositories
A live update server mode with filesystem watching for continuous background indexing during development

Common Use Cases

Building a semantic code search or code-intelligence layer for a repository, keeping embeddings updated on every commit without full re-indexing
Indexing a corporate knowledge base — Slack exports, meeting notes, Google Drive documents — into a vector store for an enterprise AI assistant
Constructing and maintaining a knowledge graph from unstructured documents using Neo4j or FalkorDB as the target store
Processing a continuously growing PDF or document corpus into a pgvector-backed RAG pipeline with cache-efficient incremental updates
Streaming Kafka topics into a vector store or graph database with exactly-once semantics and delta-only processing
Wiring AI coding agents such as Claude Code or Cursor to a live, AST-aware semantic index of a monorepo via the CocoIndex-code MCP server

Under The Hood

Architecture CocoIndex follows a clean layered architecture that separates pipeline declaration, incremental execution, and storage persistence into distinct concerns. The declarative Python API is a thin facade over a Rust engine core where the actual work happens — the App type tracks a tree of Component nodes, each representing a pipeline stage that owns its own memoized state and tracks child invalidation independently. When a source document changes, invalidation signals propagate down the component tree and only the affected subtrees are re-executed, while unaffected branches return immediately from cache. State persistence uses LMDB (via the heed crate) as an embedded key-value store for memoized hashes and target state tracking; all LMDB-specific code is isolated behind the state_store module so the engine layer never touches storage primitives directly. This separation of declaration, incremental logic, and persistence makes the system easy to reason about and extend without cascading side effects.

Tech Stack The core incremental engine is written in Rust using Tokio for async execution and Rayon for CPU-bound parallelism, compiled as a native extension and exposed to Python via PyO3 and Maturin. The Python layer uses Click for its CLI, Rich for progress display, and msgspec for high-performance serialization of memo states. LMDB serves as the embedded state store for tracking what has been indexed across pipeline runs. The connector ecosystem includes asyncpg for PostgreSQL, aiobotocore for Amazon S3, confluent-kafka for Kafka consumption, neo4j and falkordb for graph targets, and qdrant-client, lancedb, and pgvector for vector stores. Text processing is handled by a dedicated Rust crate (ops_text) that implements AST-aware chunking for Python, TypeScript, Rust, and Go. The build system uses Maturin for Rust-Python wheel generation and uv for Python dependency management.

Code Quality The codebase demonstrates a high standard of engineering discipline across both the Rust and Python layers. The Python package is fully typed with strict mypy enabled and a py.typed marker, and the Rust code is idiomatic with comprehensive use of anyhow, typed error propagation, and trait-based abstractions. The test suite is extensive — the python/tests/core/ directory alone contains over forty focused test modules covering incremental logic change detection, memo state validation, concurrency control, cancellation, provider generation, and typed serde round-trips, all backed by pytest-asyncio and testcontainers for database integration testing. CI runs on every push with separate workflows for fast linting, full integration tests, type checking, link validation, and release. Connector code follows a consistent pattern of optional dependency groups so users only pay for what they install.

What Makes It Unique The genuinely novel contribution of CocoIndex is its content-addressed memoization model applied to AI data pipelines. While streaming ETL tools (Flink, Kafka Streams) process change feeds and batch ETL tools process full snapshots, CocoIndex occupies a different position: it runs in bounded time on any trigger by tracking input hashes rather than event streams, making it suitable for workflows where sources do not emit change events natively — such as a filesystem, a Google Drive folder, or a PostgreSQL table. The memo=True decorator semantics, where a function is guaranteed to be re-executed only when hash(input) or hash(code) changes, are a clean primitive that maps well to the embedding use case where model upgrades must force re-computation while unchanged documents stay cached. The CocoIndex-code MCP server further extends this into the AI coding agent space with AST-level granularity, a use case that competing general-purpose ETL frameworks do not address.

Self-Hosting

CocoIndex is released under the Apache License 2.0, a permissive open-source license with no copyleft conditions. You can use it commercially, modify it, redistribute it, and embed it in proprietary products without any obligation to open-source your own code. The only requirements are attribution and preservation of the license notice.

Running CocoIndex yourself requires a PostgreSQL instance as the state store — it uses Postgres to persist memo hashes and track what has been indexed across runs. Beyond that, you are responsible for provisioning the target stores you configure (Qdrant, Neo4j, LanceDB, etc.), managing scaling as data volumes grow, and ensuring uptime for any continuous-indexing daemon you run. The library itself is stateless between invocations; durability lives in Postgres and your chosen vector or graph store. Teams operating at scale will need to think about resource allocation for the Rust execution core, connection pooling, and parallel execution concurrency tuning.

There is currently no official managed cloud offering or paid enterprise tier. All features — connectors, memoization, live update server, MCP integration — are available in the open-source package. The trade-off compared to a managed solution is that you own operations entirely: no SLAs, no vendor-managed upgrades, no built-in monitoring dashboards. Community support is available through Discord; enterprise-grade support channels do not yet exist. Teams that need production guarantees will need to build their own observability and reliability layers around the framework.

On This Page

Repository Health

Pre-computed score based on development activity, maintenance, community, maturity, and trend momentum.

87/100Excellent

Development Activity100

Maintenance100

Community68

Maturity40

Momentum40

Growing community supportVery active developmentWell-maintained with consistent updatesRapidly growing project

Technical Analysis

85/100Excellent

Architecture90

Code Quality88

Innovation88

Learning Curve72

Repository Stats

Contributors

Total Commits

1,967

Monthly Commits

113

Watchers

Repo Age

1.3 years

Last Commit

1 day ago

Built With

Rust51.7%

Python48.0%

Recent Releases

100 total

~6.2 releases/month

Topics

agentic-data-framework ai ai-agents change-data-capture codebase-intelligence context-engineering data-engineering data-indexing data-processing etl indexing knowledge-graph