Flowfile

Name: Flowfile
Rating: 5 (321 reviews)

Visual ETL that compiles to Polars — build pipelines on a canvas, export as standalone Python, and run anywhere without platform lock-in.

321stars

25forks

MIT License

Python

View Source Visit Website

On This Page

Flowfile is an open-source visual ETL tool that bridges the gap between low-code pipeline design and production-grade Python code. You build data pipelines by dragging nodes onto a canvas, connecting them visually, and watching live data previews update as you work. When you’re done, Flowfile exports the entire flow as standalone Polars Python code that runs anywhere with no Flowfile dependency required.

At its core, Flowfile is a monorepo of interconnected services: a FastAPI backend that executes ETL logic using the Polars dataframe library, a separate compute worker that handles heavy data processing tasks, and a Vue 3 + VueFlow frontend that renders the visual canvas. The same pipeline can be built either visually or programmatically through Flowfile’s Python API, which mirrors Polars’ familiar chained method syntax. Switching between the two is seamless — call open_graph_in_editor() on any Python-defined pipeline and it opens immediately in the visual designer.

Beyond the canvas, Flowfile ships an integrated data catalog backed by Delta Lake for time-travel and versioning, a SQL editor with embedded visualization powered by Graphic Walker, a built-in scheduler supporting interval and event-triggered runs, sandboxed Python kernel containers for running custom user code in isolation, and support for reading from and writing to PostgreSQL, MySQL, SQL Server, Oracle, S3, Azure Blob Storage, Google Cloud Storage, and Kafka. The browser-based demo runs entirely via Pyodide — no installation needed to try it.

Flowfile is MIT-licensed and available as a pip package, a desktop app for Windows, macOS, and Linux built with Tauri, or as a Docker Compose stack. An npm package (flowfile-editor) lets web developers embed the visual ETL canvas as a standalone Vue component in their own applications.

What You Get

A visual canvas with 40+ node types covering joins, fuzzy matching, filters, pivots, aggregations, text-to-rows, formula editing, and a Polars code escape-hatch
Bidirectional Python API that mirrors Polars syntax — build a flow in code and open it in the visual designer, or export any canvas flow as standalone Polars Python
An integrated Delta Lake data catalog with Unity-style hierarchy, version history, time travel, and virtual flow tables with lazy pushdown optimization
A SQL editor backed by Polars SQLContext that queries any registered catalog table, with results feeding directly into Graphic Walker for visual exploration
A built-in scheduler that runs flows on intervals or triggers them when catalog tables update, with run history, logs, and cancellation in the UI
Sandboxed Python kernel containers (Docker-based) for executing custom user code with their own package environments and Jupyter-style notebooks
Connectors for PostgreSQL, MySQL, SQL Server, Oracle, S3, Azure Blob, Google Cloud Storage, and Kafka/Redpanda — plus local files and databases
A WASM browser demo at demo.flowfile.org and an embeddable Vue component (flowfile-editor) for dropping the canvas into any web app

Common Use Cases

Data analysts building ETL pipelines visually without writing boilerplate Polars code, then exporting the result as production-ready Python scripts
Data engineers integrating Kafka streams, cloud storage buckets, and multiple databases into a unified pipeline with a scheduler handling incremental loads
Python developers using Flowfile’s programmatic API to define transformation logic in code, then sharing the visual representation with non-technical stakeholders
Teams building lightweight data platforms without a cloud data warehouse — using the Delta catalog for versioned storage, the SQL editor for exploration, and scheduled flows for refresh
Developers embedding a Polars-powered visual ETL canvas into their own SaaS application using the standalone flowfile-editor npm component
Data scientists running exploratory transformations in the visual designer and dispatching custom ML code to isolated kernel containers to avoid polluting the host environment

Under The Hood

Architecture Flowfile is organized as a layered monorepo where three runtime services — the Core FastAPI backend, the Worker FastAPI service, and the Vue 3 frontend — communicate over HTTP and WebSocket. The Core service owns the FlowGraph abstraction: a directed acyclic graph of FlowNode instances where each node encapsulates its transformation function, caching hash, connection references, and execution state. A topological execution planner computes parallelizable stages using Kahn’s algorithm, grouping nodes with no mutual dependencies into the same stage so they can run concurrently. Nodes offload expensive data operations to the Worker service via subprocess and loky-based futures, keeping the Core responsive for UI interactions. The catalog, scheduler, kernel manager, and AI assistant are each structured as independent FastAPI routers mounted onto the Core, keeping concerns separated and individually testable.

Tech Stack The backend is Python 3.10–3.13 with FastAPI and Uvicorn, using Polars as the primary dataframe engine throughout the transformation pipeline. Delta Lake (via the deltalake Python binding) backs the data catalog, enabling versioned table writes and time-travel queries. External database connectivity uses ConnectorX for high-throughput reads from PostgreSQL, MySQL, SQL Server, and Oracle. Cloud storage is handled by s3fs, adlfs (Azure), and gcsfs (GCS). The Kafka integration uses confluent-kafka. The worker offloads heavy computation using loky’s process pools. The frontend is built with Vue 3, VueFlow for the node-graph canvas, and Graphic Walker for embedded visualization. The desktop build uses Tauri 2 (Rust shell wrapping the web frontend). The browser-only WASM variant runs Python via Pyodide. Alembic manages the SQLite/PostgreSQL metadata schema, and Docker SDK is used programmatically to create and manage kernel sandbox containers.

Code Quality The codebase uses Pydantic v2 models extensively for schema validation across API boundaries, giving strong type guarantees at the HTTP layer. Type hints are applied consistently through the core business logic. Ruff handles both linting and formatting with a 120-character line length, configured with pyupgrade, flake8-bugbear, and isort rules. The test suite uses pytest with pytest-asyncio and testcontainers for database and cloud storage integration tests, with dedicated Docker-based CI workflows for kernel E2E tests and Docker authentication tests. The CI runs CodeQL for security analysis. The README explicitly acknowledges that comprehensive test coverage is a pending TODO item, meaning integration tests exist but unit test depth is uneven — the Worker and Core execution paths are covered, but coverage across all 40+ node types is limited.

What Makes It Unique The genuinely novel contribution is the bidirectional compilation between visual canvas and executable Python code — not just one-way export, but true bidirectionality where Python API calls build the same FlowGraph object the visual designer operates on. Virtual flow tables extend this further: a catalog table backed by a lazy Polars LazyFrame can cross flow boundaries with filter and projection pushdown intact, meaning upstream optimization continues across independently-designed pipelines. The WASM mode running a 20+ node ETL engine entirely in the browser via Pyodide, with no server-side component, is uncommon at this feature depth. The combination of Delta Lake versioning, cross-flow lazy optimization, browser-native execution, and bidirectional code/canvas synchronization in a single MIT-licensed tool has no direct open-source equivalent.

Self-Hosting

Flowfile is released under the MIT License, one of the most permissive open-source licenses available. You can use it commercially, modify it, distribute it, incorporate it into proprietary products, and self-host it without any royalty or attribution obligation beyond preserving the license notice. There are no open-core restrictions, no feature gates, and no license key checks anywhere in the codebase — every capability described in the README is available to anyone who clones the repository.

Running Flowfile yourself means operating three interconnected services: the Core FastAPI server (port 63578), the Worker FastAPI service (port 63579), and the Vue frontend. The Docker Compose setup bundles all three and is the recommended self-hosted deployment path. Because Flowfile uses Docker to spawn sandboxed Python kernel containers for user code execution, the host must have access to the Docker socket — a meaningful operational consideration in production environments where socket exposure is a security concern. The project documentation suggests using a Docker socket proxy to limit the API surface. Persistent storage for the Delta catalog, saved flows, and internal state is backed by local volumes, so you are responsible for backup, replication, and disaster recovery. Alembic handles database migrations automatically on startup, but managing the SQLite/Postgres metadata database and its schema evolution over time is your responsibility.

There is currently no hosted cloud offering, managed service tier, or enterprise support contract for Flowfile — it is a single-developer open-source project with active community contributions. This means you get the full feature set with no artificial limits, but also no SLA, no guaranteed upgrade path, and no official support channel beyond GitHub Issues and Discussions. The feature roadmap explicitly notes that multi-user collaboration and role-based access control are not yet implemented, which limits suitability for teams sharing a single instance in production without additional access controls applied at the infrastructure level.

On This Page

Repository Health

Pre-computed score based on development activity, maintenance, community, maturity, and trend momentum.

82/100Excellent

Development Activity100

Maintenance100

Community48

Maturity40

Momentum40

Very active developmentWell-maintained with consistent updatesRapidly growing project

Technical Analysis

81/100Excellent

Architecture82

Code Quality72

Innovation88

Learning Curve80

Repository Stats

Contributors

Total Commits

647

Monthly Commits

Watchers

Repo Age

1.7 years

Last Commit

2 days ago

Built With

Python65.4%

Vue20.8%

TypeScript12.2%

Recent Releases

69 total

~3.5 releases/month

Alternative To

Artie Matillion

Topics

drag-and-drop electron-app etl etl-pipeline polars python visual-programming vue

Related Apps

C++

70%

Apache 2.0

ClickHouse

Databases · Analytics · Data Engineering

48,457

Open-source column-oriented database that delivers real-time analytical queries on petabyte-scale data with millisecond latency.

View details

ClickHouse

Apache Airflow

Data Engineering

46,032

Define, schedule, and monitor complex data workflows as Python code — with a powerful UI, 80+ provider integrations, and battle-tested scalability across thousands of production deployments.

View details

Apache Airflow

Label Studio

AI Development · Data Engineering

27,735

Label Studio is an open-source, multi-type data labeling platform that lets teams annotate images, text, audio, video, and time series data with a configurable XML-based UI and export annotations in formats ready for any ML framework.

View details