Turn messy real-world dataframes into machine learning features — no manual wrangling required.
skrub (formerly dirty_cat) is a Python library purpose-built for the unglamorous reality of machine learning with tabular data: columns full of typos, inconsistent date formats, mixed types, and high-cardinality string categories that standard encoders can’t handle. Where scikit-learn pipelines stop, skrub picks up, providing a suite of specialized transformers, encoders, and joiners that convert raw, dirty dataframes directly into numeric features ready for any ML estimator.
At its core sits the TableVectorizer, a single transformer that automatically inspects each column’s dtype and applies the appropriate encoding — one-hot for low-cardinality categories, the novel GapEncoder or MinHashEncoder for dirty strings, DatetimeEncoder for temporal columns, and passthrough for numerics. Wrapping a scikit-learn estimator with tabular_pipeline() produces a complete, fit-predict pipeline with sensible preprocessing defaults in a single call.
skrub also ships a powerful fuzzy-join system (Joiner, AggJoiner, InterpolationJoiner) that merges dataframes on approximate string or datetime keys rather than requiring exact matches — critical when combining datasets from different sources where entity names are spelled differently. The library’s DataOps system goes further, enabling declarative computation graphs that embed hyperparameter search spaces directly into data preparation steps, so Optuna or scikit-learn’s cross-validation can explore preprocessing choices alongside model hyperparameters.
Built on top of scikit-learn’s estimator API, skrub components drop into existing pipelines without friction. It supports both pandas and Polars dataframes through an internal dispatch layer, and ships TableReport — a standalone HTML report with interactive plots, association heatmaps, and per-column distributions — for rapid data exploration before any modelling begins.
tabular_pipeline(HistGradientBoostingClassifier()) and get a cross-validated baseline without writing any custom preprocessingArchitecture
skrub is organized as a flat package of composable, single-responsibility transformers built on top of scikit-learn’s BaseEstimator / TransformerMixin API. The core design pattern is a SingleColumnTransformer base class that encapsulates column-level logic, combined with a ApplyToCols meta-transformer that distributes a single-column transformer across all matching columns in a dataframe. TableVectorizer sits one level above, using a selector system and a dispatch layer (_dataframe) to route each column to the appropriate column-level transformer based on dtype. The DataOps system introduces a separate computation-graph abstraction on top of this stack: var() and X/y create symbolic nodes, and attribute accesses or method calls on those nodes construct a lazy DAG rather than executing immediately. The DAG is evaluated by SkrubLearner, which also interprets embedded choose_* nodes as hyperparameter spaces for Optuna or scikit-learn search.
Tech Stack
skrub is a pure Python 3.10+ library published on PyPI and conda-forge. Core numeric computation relies on NumPy and SciPy, while matrix factorization in GapEncoder uses scikit-learn’s NMF primitives and KMeans for initialization. The string encoding pipeline uses scikit-learn’s CountVectorizer and HashingVectorizer for n-gram extraction. The fuzzy-join system applies TF-IDF vectorization followed by nearest-neighbor search from scikit-learn. Polars support is implemented through a runtime-dispatched abstraction layer in _dataframe/_common.py that detects the dataframe library and routes function calls accordingly — pyarrow is an optional dependency. HTML reporting uses Jinja2 templates with a shadow DOM custom element (pure.css, embedded JavaScript) for notebook-safe rendering. Build tooling uses setuptools with setuptools-scm for version management from git tags; pixi manages the multi-environment CI matrix; ruff handles linting and code style; CircleCI runs the test suite.
Code Quality
skrub has extensive test coverage spanning 46 test files, one per module, covering unit tests, doctest integration (enabled globally in pytest config), and JavaScript Cypress end-to-end tests for the TableReport browser component. Pytest is configured with strict doctest mode (--doctest-modules) and treats all FutureWarning and DeprecationWarning as errors, enforcing forward-compatibility discipline. Type annotations are used selectively for public APIs. The codebase enforces consistent style with ruff (lines ≤88 chars, imports sorted, pyupgrade rules), and pre-commit hooks run ruff on every commit. CI runs across Python 3.10–3.14 with min-dependency pinned environments as well as nightly wheel environments to catch upstream breakage early. Documentation is thorough — full sphinx gallery with runnable examples, a CONTRIBUTING guide, API reference with numpydoc, and a vision statement explaining design philosophy.
What Makes It Unique The GapEncoder is the library’s most technically distinctive contribution: a Gamma-Poisson NMF model that treats an input string as a bag of character n-grams and factorizes the resulting count matrix to learn latent topics. This produces dense, continuous embeddings for dirty categorical data (typos, abbreviations, free text) that capture morphological similarity without any predefined vocabulary or label-encoding assumptions — something standard sklearn encoders cannot do. The fuzzy-join system is equally uncommon: rather than requiring exact key alignment, Joiner embeds join keys from both tables via TF-IDF and nearest-neighbor search, enabling approximate entity matching as a fit-transform step inside a scikit-learn pipeline. The DataOps system is a third differentiator: expressing preprocessing choices as lazy computation graphs that interoperate with Optuna means the entire ML pipeline — data wrangling choices included — is searchable as a joint hyperparameter space, an architecture not found in vanilla scikit-learn or pandas-based tooling.
skrub is released under the BSD 3-Clause License, one of the most permissive open-source licenses available. You can use it commercially, modify it, distribute it, and incorporate it into proprietary products without any copyleft obligations. The only requirements are that the copyright notice and license text be preserved in distributions, and that the skrub authors’ names not be used to endorse derivative products. For most data science teams and organizations, this license imposes no practical restrictions.
Running skrub yourself is straightforward from an operational standpoint because it is a Python library rather than a long-running service — there are no servers to provision, databases to manage, or uptime SLAs to maintain. It installs via pip or conda into any Python 3.10+ environment and its runtime footprint is limited to process memory during training and inference. The main operational considerations are reproducibility (pinning the skrub version in requirements files) and compute resources for large datasets, since GapEncoder and fuzzy-join operations on millions of rows can be memory-intensive and benefit from multi-core parallelism via joblib.
There is no official managed or hosted version of skrub — it is purely a library. This means there are no enterprise support tiers, no SLAs, no managed upgrades, and no vendor-provided monitoring. Community support is available through GitHub Issues, GitHub Discussions, and a Discord server maintained by the core team. For teams that need guaranteed support response times or commercial warranties, a support contract with a third-party vendor familiar with the PyData ecosystem would be required, as the skrub project itself does not offer this.
No Code Platforms · AI Development · Developer Tools
Visual LLM workflow platform with RAG pipelines, agent capabilities, and model management for building production AI applications.
Developer Tools · Game Development · Design Tools
Free, MIT-licensed 2D and 3D game engine with one-click multi-platform export and no royalties.
Developer Tools · Databases · Search
The open-source Postgres development platform that replaces Firebase with authentication, real-time APIs, edge functions, storage, and vector embeddings — all built on PostgreSQL.