Visual ETL that compiles to Polars — build pipelines on a canvas, export as standalone Python, and run anywhere without platform lock-in.
Flowfile is an open-source visual ETL tool that bridges the gap between low-code pipeline design and production-grade Python code. You build data pipelines by dragging nodes onto a canvas, connecting them visually, and watching live data previews update as you work. When you’re done, Flowfile exports the entire flow as standalone Polars Python code that runs anywhere with no Flowfile dependency required.
At its core, Flowfile is a monorepo of interconnected services: a FastAPI backend that executes ETL logic using the Polars dataframe library, a separate compute worker that handles heavy data processing tasks, and a Vue 3 + VueFlow frontend that renders the visual canvas. The same pipeline can be built either visually or programmatically through Flowfile’s Python API, which mirrors Polars’ familiar chained method syntax. Switching between the two is seamless — call open_graph_in_editor() on any Python-defined pipeline and it opens immediately in the visual designer.
Beyond the canvas, Flowfile ships an integrated data catalog backed by Delta Lake for time-travel and versioning, a SQL editor with embedded visualization powered by Graphic Walker, a built-in scheduler supporting interval and event-triggered runs, sandboxed Python kernel containers for running custom user code in isolation, and support for reading from and writing to PostgreSQL, MySQL, SQL Server, Oracle, S3, Azure Blob Storage, Google Cloud Storage, and Kafka. The browser-based demo runs entirely via Pyodide — no installation needed to try it.
Flowfile is MIT-licensed and available as a pip package, a desktop app for Windows, macOS, and Linux built with Tauri, or as a Docker Compose stack. An npm package (flowfile-editor) lets web developers embed the visual ETL canvas as a standalone Vue component in their own applications.
flowfile-editor) for dropping the canvas into any web appflowfile-editor npm componentArchitecture Flowfile is organized as a layered monorepo where three runtime services — the Core FastAPI backend, the Worker FastAPI service, and the Vue 3 frontend — communicate over HTTP and WebSocket. The Core service owns the FlowGraph abstraction: a directed acyclic graph of FlowNode instances where each node encapsulates its transformation function, caching hash, connection references, and execution state. A topological execution planner computes parallelizable stages using Kahn’s algorithm, grouping nodes with no mutual dependencies into the same stage so they can run concurrently. Nodes offload expensive data operations to the Worker service via subprocess and loky-based futures, keeping the Core responsive for UI interactions. The catalog, scheduler, kernel manager, and AI assistant are each structured as independent FastAPI routers mounted onto the Core, keeping concerns separated and individually testable.
Tech Stack
The backend is Python 3.10–3.13 with FastAPI and Uvicorn, using Polars as the primary dataframe engine throughout the transformation pipeline. Delta Lake (via the deltalake Python binding) backs the data catalog, enabling versioned table writes and time-travel queries. External database connectivity uses ConnectorX for high-throughput reads from PostgreSQL, MySQL, SQL Server, and Oracle. Cloud storage is handled by s3fs, adlfs (Azure), and gcsfs (GCS). The Kafka integration uses confluent-kafka. The worker offloads heavy computation using loky’s process pools. The frontend is built with Vue 3, VueFlow for the node-graph canvas, and Graphic Walker for embedded visualization. The desktop build uses Tauri 2 (Rust shell wrapping the web frontend). The browser-only WASM variant runs Python via Pyodide. Alembic manages the SQLite/PostgreSQL metadata schema, and Docker SDK is used programmatically to create and manage kernel sandbox containers.
Code Quality The codebase uses Pydantic v2 models extensively for schema validation across API boundaries, giving strong type guarantees at the HTTP layer. Type hints are applied consistently through the core business logic. Ruff handles both linting and formatting with a 120-character line length, configured with pyupgrade, flake8-bugbear, and isort rules. The test suite uses pytest with pytest-asyncio and testcontainers for database and cloud storage integration tests, with dedicated Docker-based CI workflows for kernel E2E tests and Docker authentication tests. The CI runs CodeQL for security analysis. The README explicitly acknowledges that comprehensive test coverage is a pending TODO item, meaning integration tests exist but unit test depth is uneven — the Worker and Core execution paths are covered, but coverage across all 40+ node types is limited.
What Makes It Unique The genuinely novel contribution is the bidirectional compilation between visual canvas and executable Python code — not just one-way export, but true bidirectionality where Python API calls build the same FlowGraph object the visual designer operates on. Virtual flow tables extend this further: a catalog table backed by a lazy Polars LazyFrame can cross flow boundaries with filter and projection pushdown intact, meaning upstream optimization continues across independently-designed pipelines. The WASM mode running a 20+ node ETL engine entirely in the browser via Pyodide, with no server-side component, is uncommon at this feature depth. The combination of Delta Lake versioning, cross-flow lazy optimization, browser-native execution, and bidirectional code/canvas synchronization in a single MIT-licensed tool has no direct open-source equivalent.
Flowfile is released under the MIT License, one of the most permissive open-source licenses available. You can use it commercially, modify it, distribute it, incorporate it into proprietary products, and self-host it without any royalty or attribution obligation beyond preserving the license notice. There are no open-core restrictions, no feature gates, and no license key checks anywhere in the codebase — every capability described in the README is available to anyone who clones the repository.
Running Flowfile yourself means operating three interconnected services: the Core FastAPI server (port 63578), the Worker FastAPI service (port 63579), and the Vue frontend. The Docker Compose setup bundles all three and is the recommended self-hosted deployment path. Because Flowfile uses Docker to spawn sandboxed Python kernel containers for user code execution, the host must have access to the Docker socket — a meaningful operational consideration in production environments where socket exposure is a security concern. The project documentation suggests using a Docker socket proxy to limit the API surface. Persistent storage for the Delta catalog, saved flows, and internal state is backed by local volumes, so you are responsible for backup, replication, and disaster recovery. Alembic handles database migrations automatically on startup, but managing the SQLite/Postgres metadata database and its schema evolution over time is your responsibility.
There is currently no hosted cloud offering, managed service tier, or enterprise support contract for Flowfile — it is a single-developer open-source project with active community contributions. This means you get the full feature set with no artificial limits, but also no SLA, no guaranteed upgrade path, and no official support channel beyond GitHub Issues and Discussions. The feature roadmap explicitly notes that multi-user collaboration and role-based access control are not yet implemented, which limits suitability for teams sharing a single instance in production without additional access controls applied at the infrastructure level.
Databases · Analytics · Data Engineering
Open-source column-oriented database that delivers real-time analytical queries on petabyte-scale data with millisecond latency.
Devops · Data Engineering · Automation
Event-driven orchestration platform for data, AI, and infrastructure workflows — define everything in YAML, run anywhere at scale.
Analytics · Data Engineering · AI Assistants
Open-source BI tool with drag-and-drop dashboards, 20+ data source connectors, and AI-powered natural language queries — a self-hosted alternative to Tableau.