Volga
A Rust-based real-time data processing engine for AI/ML feature computation, built on Apache DataFusion and Arrow — positioned as an alternative to Flink, Spark, Chronon, and OpenMLDB with unified streaming, batch, and request-time execution.
Volga targets a specific pain point in modern AI/ML systems (recommendation engines, fraud detection, personalization, search, RAG): features need to be computed consistently across streaming, batch, and request-time contexts, which typically requires stitching together separate systems like Flink for streaming and Spark for batch, plus AI/ML-specific tools like Airbnb’s Chronon or OpenMLDB for feature serving.
Volga instead aims to unify these execution modes in one engine, built in Rust on top of Apache DataFusion (query execution) and Apache Arrow (columnar data format), specializing specifically in continuous window aggregations — a common and often awkward-to-implement pattern in feature engineering pipelines.
Apache-2.0 licensed, Volga is an early-stage project (per its own GitHub activity metrics) documented in depth on the project’s Substack blog, explaining the rationale as a considered rewrite/rethink of existing streaming and feature-engineering infrastructure rather than an incremental tweak.
What You Get
- A single engine for streaming, batch, and request-time feature computation instead of stitching together separate systems
- Specialized support for continuous window aggregations, a common but awkward feature-engineering pattern
- Built on Apache DataFusion for query execution and Apache Arrow for columnar data representation
- SQL as a query interface for defining feature computation logic
Common Use Cases
- Computing ML features consistently across streaming, batch, and request-time contexts for recommendation or fraud-detection systems
- Replacing a Flink+Spark+feature-store stack with one unified engine for AI/ML data pipelines
- Running continuous window aggregations for real-time personalization or search ranking features
- Building RAG or search systems that need features computed consistently between offline training and online serving
Under The Hood
Architecture Volga is built on Apache DataFusion for its query execution engine and Apache Arrow for in-memory columnar data representation, rather than implementing a custom execution engine from scratch — leveraging DataFusion’s SQL query planning and Arrow’s efficient columnar operations as a foundation. The unified streaming/batch/request-time execution model is the core architectural bet: rather than three separate systems each computing features differently, Volga aims for one execution semantics applied consistently across all three contexts.
Tech Stack Rust for the core engine, Apache DataFusion for SQL query execution, and Apache Arrow for columnar data handling — the same foundational technologies used by several modern data engines, applied here specifically to the AI/ML feature-computation use case.
Code Quality The project documents its design rationale extensively on a dedicated Substack blog rather than relying solely on README claims, providing more context for evaluating its architectural choices; GitHub activity metrics show the project is still early-stage with somewhat inconsistent maintenance cadence typical of a young infrastructure project.
What Makes It Unique Most teams solve the streaming/batch/request-time feature-consistency problem by combining multiple specialized systems (Flink, Spark, Chronon, OpenMLDB); Volga’s bet is that a single Rust engine built on DataFusion and Arrow can unify all three execution modes with consistent semantics, avoiding the operational and consistency overhead of running several different systems for the same underlying feature-computation problem.
Self-Hosting
Licensing Model Apache-2.0 licensed — fully open source with no license key.
Self-Hosting Restrictions Not applicable; it’s a self-hosted data processing engine you deploy within your own infrastructure.
License Key Required No.
Related Apps
ClickHouse
Databases · Analytics · Data Engineering
Open-source column-oriented database that delivers real-time analytical queries on petabyte-scale data with millisecond latency.
ClickHouse
Apache 2.0Apache Airflow
Data Engineering
Define, schedule, and monitor complex data workflows as Python code — with a powerful UI, 80+ provider integrations, and battle-tested scalability across thousands of production deployments.
Apache Airflow
Apache 2.0Label Studio
AI Development · Data Engineering
Label Studio is an open-source, multi-type data labeling platform that lets teams annotate images, text, audio, video, and time series data with a configurable XML-based UI and export annotations in formats ready for any ML framework.