ParadeDB
Born out of Y Combinator's S2023 batch, ParadeDB is a Postgres extension that delivers Elasticsearch-quality BM25 search and real-time analytics without a separate search cluster to manage.
ParadeDB is an open source Postgres extension that brings Elasticsearch-quality full-text search and analytics directly inside your Postgres database. Instead of running a separate Elasticsearch or OpenSearch cluster alongside Postgres and syncing data between the two with an ETL pipeline, teams create a BM25 index on their existing tables and query it with SQL, keeping search results transactionally consistent with the underlying data. The project came out of Y Combinator’s Summer 2023 batch and has since become a widely starred Postgres extension.
Under the hood, ParadeDB adds a custom Postgres index type — the BM25 index — built as an LSM tree of segments, where each segment pairs a Tantivy-backed inverted index for full-text search with a columnar store for fast analytical scans. New search operators like ||| (match disjunction), &&& (match conjunction), ### (phrase), and ## (proximity) trigger a custom scan node that pushes filters, joins, and aggregates directly into the index instead of applying them in a separate execution phase. Functions like pdb.score() and pdb.snippet() provide BM25 relevance ranking and result highlighting, while pdb.agg() accepts Elasticsearch-compatible JSON aggregation queries executed against the columnar index through an embedded, forked Apache DataFusion engine for OLAP-style processing.
ParadeDB Community, the code in this repository, is licensed under AGPL-3.0 and supports the full BM25 search and analytics feature set on a single Postgres node with read replicas via logical replication. Official client integrations exist for Drizzle, Django, SQLAlchemy, Ruby on Rails, and Entity Framework Core, translating ParadeDB’s SQL operators into each framework’s native query builder syntax. A closed-source ParadeDB Enterprise tier adds high availability, physical read replica support for the BM25 index, and unlimited cluster size for teams that need it.
Development is unusually rigorous for an open source project of this kind: alongside standard unit and pg_regress tests, ParadeDB runs client-side integration tests with property-based query generation and an Antithesis deterministic-simulation-testing harness to catch rare concurrency bugs, plus a dedicated Stressgres tool for replaying production-like workloads under load.
What You Get
- A custom BM25 index type that updates in real time, inside the same transaction as the table write, instead of via asynchronous reindexing
- SQL-native search operators (
|||,&&&,###,##) for match, phrase, and proximity queries that trigger ParadeDB’s custom scan - BM25 relevance scoring (
pdb.score()) and text highlighting (pdb.snippet()) for building ranked, annotated search results - Elasticsearch-compatible JSON aggregations (
pdb.agg()) executed against a columnar index via an embedded Apache DataFusion engine - Automatic query parallelization for Top K (
ORDER BY ... LIMIT) and aggregate queries across available CPU cores - Official ORM and framework integrations for Drizzle, Django, SQLAlchemy, Ruby on Rails, and Entity Framework Core
- Beta join pushdown that executes INNER, SEMI, and ANTI joins involving search predicates directly inside the ParadeDB executor
Common Use Cases
- Replacing a standalone Elasticsearch or OpenSearch deployment for full-text search on data that already lives in Postgres, eliminating ETL and dual-write consistency issues
- Building e-commerce or marketplace product search with BM25 relevance ranking, faceted filtering, and highlighted results
- Running Elasticsearch-style aggregations and dashboards directly against Postgres tables instead of maintaining a separate OLAP pipeline
- Powering AI agent and RAG applications that combine ParadeDB full-text search with pgvector similarity search for hybrid retrieval
- Normalizing relational data model with SQL JOINs for parent-child or nested search relationships, instead of denormalizing into Elasticsearch documents
Under The Hood
Architecture
pg_search integrates with Postgres through pgrx’s extension points rather than patching Postgres internals: postgres::customscan registers planner hooks (register_rel_pathlist, register_upper_path for aggregates, register_join_pathlist, plus a subplan-join and window-aggregate hook) that intercept query planning only when a ParadeDB operator is present, otherwise falling through to native Postgres execution entirely, as spelled out in lib.rs’s _PG_init. The BM25 index itself (organized under index/, with directory, reader, and writer submodules) is structured as an LSM tree of segments, where each segment pairs a Tantivy-backed inverted index with a columnar fast-field store, giving write-optimized ingestion and read-optimized search in the same structure. Query translation lives in query/ (builder.rs, pdb_query.rs, range.rs, proximity/), turning SQL predicates and custom operators into Tantivy queries, while aggregate/ and postgres/customscan/aggregatescan hand off GROUP BY and aggregate plans to an embedded, forked Apache DataFusion for OLAP-style execution over those columnar segments. This is a layered, hook-based extension architecture — pgrx bridge, planner hooks, query/aggregate translation, Tantivy/DataFusion execution — rather than a monolith; changing the BM25 index’s on-disk format would ripple through directory/reader/writer, the customscan builders, and the pg_regress and integration suites that pin its behavior.
Tech Stack
pg_search is a Rust extension built on a pinned pgrx bridge and targeting a broad range of recent Postgres major versions via feature flags. Its search engine is a fork of Tantivy and its analytics engine a fork of Apache DataFusion, both pulled from ParadeDB-maintained forks at pinned revisions; supporting crates include the arrow-* family for columnar in-memory data, anyhow and thiserror for typed error handling, serde/serde_json/serde_cbor/bincode/postcard for index serialization, parking_lot for concurrency primitives, and a workspace-local tokenizers crate plus tantivy-jieba for CJK tokenization. Builds are managed with cargo-pgrx, with Nix available for reproducible development environments and Docker images for packaged deployment; the project publishes through Debian, RHEL/Ubuntu, PGXN, Homebrew/Postgres.app, and Docker Hub release pipelines. Deployment targets range from a plain self-hosted Postgres extension to a CloudNativePG-based Kubernetes Helm chart to one-click templates on cloud platform providers, with a separate closed-source Enterprise/BYOC tier layered on top for high availability and read replicas.
Code Quality
Testing is organized into distinct categories described in the project’s contributing guide: pg_regress golden-output tests for visually inspectable SQL results, integration tests that run as an external client against an installed extension (including property-based “client property tests” that generate randomized queries via a query generator), in-process unit and #[pg_test] tests inside pg_search/src that exercise the full pgrx/Postgres API, and Stressgres, a purpose-built stress-testing tool that replays representative workloads to catch concurrency and performance regressions; the suite is rounded out by an Antithesis deterministic-simulation-testing harness for surfacing rare concurrency bugs and dedicated upgrade-compatibility tests. Error handling favors typed, explicit propagation via anyhow::Result and thiserror over silent unwraps in the modules read. Continuous integration is extensive, with dedicated linting workflows for Rust, Bash, Docker, YAML, and Markdown, plus separate test and publish pipelines per supported Postgres version and per OS/package format. This combination of golden, property-based, and deterministic-simulation testing is a level of rigor closer to database-grade infrastructure projects than typical application code.
What Makes It Unique The core idea is architectural rather than a feature checklist: instead of running Elasticsearch alongside Postgres and syncing data through an ETL pipeline, ParadeDB implements Elasticsearch-class search — BM25 scoring, an Elasticsearch-compatible aggregation JSON syntax, phrase/proximity/fuzzy/regex queries — as a native Postgres index type, so index updates happen synchronously inside the same transaction as the underlying table write, giving full ACID consistency that a denormalized, eventually-consistent document store cannot offer. The BM25 index’s LSM-tree segment design, colocating an inverted index and a columnar store per segment, lets the same index serve both point full-text lookups and OLAP-style aggregations through an embedded DataFusion executor, avoiding a separate analytics pipeline. It isn’t the first project to bring search into Postgres — prior extensions have paired Postgres with external search engines — but the combination of a native custom-scan-driven index, an embedded OLAP engine, and deterministic-simulation testing rigor pushes this past a typical wrapper or thin client.
Self-Hosting
Licensing Model ParadeDB Community (the code in this repository) is licensed under AGPL-3.0 — free to self-host, modify, and distribute, provided distributed derivative works are released under the same copyleft license.
Self-Hosting Restrictions
- High availability support is not available in Community; it requires ParadeDB Enterprise
- Read replica support for the BM25 index is not available in Community, since Community does not support physical replication for the index (BM25 indexes exist only on the primary)
- Maximum cluster size in Community is limited to a single node; Enterprise supports unlimited cluster size
Enterprise Features
- Waives the AGPL-3.0 copyleft provision, permitting use in proprietary derivative works
- Adds closed-source features: high availability, physical read replica support for BM25 indexes, and unlimited cluster size
- ParadeDB BYOC (Bring Your Own Cloud) offers a managed deployment of ParadeDB Enterprise inside the customer’s own AWS or GCP account
Cloud vs Self-Hosted ParadeDB does not currently offer a fully managed cloud database — a managed cloud offering is listed as long-term roadmap work. Today, ParadeDB Community is deployed self-hosted, via one-click templates on cloud platform providers, or as ParadeDB BYOC (Enterprise only) inside the customer’s own cloud account.
License Key Required No. ParadeDB Community is fully usable with no license key. A commercial agreement is only needed to obtain ParadeDB Enterprise (waived copyleft plus high availability, read replicas, and unlimited cluster size).
Related Apps
Supabase
Developer Tools · Databases · Search
The open-source Postgres development platform that replaces Firebase with authentication, real-time APIs, edge functions, storage, and vector embeddings — all built on PostgreSQL.
Supabase
Apache 2.0Grafana
Monitoring · Analytics
The open-source observability platform that unifies metrics, logs, and traces from any data source into dynamic, queryable dashboards.
Grafana
AGPL 3.0OpenBB
Databases · Analytics · Invoicing Finance
The AI Workspace for Finance: Connect Data, Run AI Agents, Build Analytics