CrateDB

Distributed SQL database for real-time analytics at scale

595forks

CrateDB is a distributed SQL database designed for real-time analytics on massive datasets, combining the familiarity of standard SQL with the scalability of NoSQL systems. It targets developers and data engineers working with high-velocity data from IoT devices, industrial sensors, streaming applications, and AI workflows who need low-latency queries and horizontal scalability without complex infrastructure management. Built in Java and leveraging Lucene for full-text search and indexing, CrateDB supports containerized deployments via Docker and Kubernetes, and integrates with cloud platforms like AWS and Azure.

CrateDB operates on a shared-nothing architecture with auto-partitioning, auto-rebalancing, and self-healing clusters. It exposes SQL via the PostgreSQL wire protocol and HTTP API, supports dynamic schemas for JSON and nested objects, and includes built-in support for time-series, geospatial, text, and vector data types. Its distributed query engine parallelizes operations across nodes, enabling fast analytics on live data without batch delays.

What You Get

  • Standard SQL with PostgreSQL wire protocol - Execute standard SQL queries using the native PostgreSQL protocol, enabling compatibility with existing PostgreSQL clients, tools, and ORMs without modification.
  • Dynamic table schemas and queryable JSON objects - Store and query semi-structured JSON data with flexible schemas, allowing new fields to be added on-the-fly without schema migrations.
  • Time-series data support - Ingest and analyze high-volume time-series data with built-in functions for interval normalization, trend analysis, and real-time monitoring.
  • Geospatial data types and queries - Store and query location data using POINT, LINESTRING, and POLYGON types with functions for distance calculations and spatial filtering.
  • Vector embedding storage and similarity search - Store vector embeddings and perform approximate nearest neighbor (ANN) searches for AI-driven recommendation systems and semantic retrieval.
  • Auto-sharding, auto-partitioning, and auto-replication - Automatically distribute data across nodes, partition tables by time or key, and replicate shards for fault tolerance without manual intervention.
  • Real-time full-text search - Perform fast, distributed full-text search on text fields using Lucene-based indexing with support for stemming, synonyms, and phrase matching.
  • High-throughput streaming ingestion - Ingest tens of thousands of records per second in real time from IoT devices, logs, or event streams without performance degradation.
  • Admin UI with SQL console - Use the built-in web-based Admin UI to run SQL queries, visualize results, and monitor cluster health through an interactive SQL interface.
  • Multi-model data support - Store and query relational, document, time-series, geospatial, text, and vector data in a single database without requiring separate systems.
  • Self-healing and auto-rebalancing clusters - Automatically detect node failures, redistribute data, and maintain high availability without operator intervention.
  • HTTP API and native clients - Interact with CrateDB via RESTful HTTP endpoints or use official clients for Python, Java, Node.js, and other languages.
  • Deployment flexibility across cloud, edge, and on-prem - Run CrateDB identically on Kubernetes, AWS, Azure, private data centers, or edge devices with consistent behavior and tooling.

Common Use Cases

  • Industrial IoT monitoring - A manufacturer uses CrateDB to ingest sensor data from thousands of machines in real time, running SQL queries to detect anomalies and predict maintenance needs.
  • Real-time analytics for connected devices - An IoT platform uses CrateDB to store and analyze telemetry from smart devices, enabling dashboards that show live usage patterns and device health.
  • AI-powered recommendation engine - A retail company stores product embeddings in CrateDB and performs vector similarity searches to recommend items based on user behavior and product attributes.
  • Time-series monitoring for DevOps - A SaaS team uses CrateDB to collect and query application metrics and logs with millisecond latency, enabling real-time alerting and performance dashboards.

Under The Hood

Architecture

  • The codebase demonstrates a layered structure, with a clear distinction between core functionality, server components, and a plugin architecture for extensibility.
  • A significant portion of the code adapts and builds upon existing Elasticsearch components, indicating a strategy of leveraging established functionality.
  • While a good degree of separation of concerns is present, some coupling exists between modules, potentially impacting maintainability.
  • The build process is well-defined, utilizing Maven and a complex assembly plugin for packaging.

Tech Stack

  • The project is primarily Java-based, with Python playing a crucial role in build processes, testing, and documentation.
  • Modern dependency management tools like uv and asdf are employed for Python environments.
  • Extensive configuration files manage code style, quality, and CI/CD integration, indicating a focus on best practices.
  • Integration with tools like Codecov, Minio, and ReadTheDocs demonstrates a commitment to quality, scalability, and documentation.

Code Quality

  • A robust testing strategy is evident, with a comprehensive suite of tests throughout the repository.
  • Naming conventions are generally consistent and readable, adhering to Java standards.
  • Type safety is well-maintained, leveraging Java’s static typing.
  • Error handling relies heavily on standard try-catch blocks, with limited use of custom exception types.

What Makes It Unique

  • CrateDB provides a SQL interface on top of Lucene/Elasticsearch, bridging the gap between traditional databases and search engine capabilities.
  • The plugin architecture allows for seamless integration with various storage systems, expanding data storage options beyond the default configuration.
  • The project demonstrates a clever adaptation of Elasticsearch components to provide a unique data processing and querying experience.

Join founders buildingwith open source

Opinionated takes, migration guides, cost-saving tips, and insights from the open source ecosystem.

Subscribe on Substack

No spam. Unsubscribe anytime.

Join 750+ subscribers
No spam. Unsubscribe anytime.

Search