skrub

Turn messy real-world dataframes into machine learning features — no manual wrangling required.

1.6Kstars
261forks
BSD 3-Clause License
Python

skrub (formerly dirty_cat) is a Python library purpose-built for the unglamorous reality of machine learning with tabular data: columns full of typos, inconsistent date formats, mixed types, and high-cardinality string categories that standard encoders can’t handle. Where scikit-learn pipelines stop, skrub picks up, providing a suite of specialized transformers, encoders, and joiners that convert raw, dirty dataframes directly into numeric features ready for any ML estimator.

At its core sits the TableVectorizer, a single transformer that automatically inspects each column’s dtype and applies the appropriate encoding — one-hot for low-cardinality categories, the novel GapEncoder or MinHashEncoder for dirty strings, DatetimeEncoder for temporal columns, and passthrough for numerics. Wrapping a scikit-learn estimator with tabular_pipeline() produces a complete, fit-predict pipeline with sensible preprocessing defaults in a single call.

skrub also ships a powerful fuzzy-join system (Joiner, AggJoiner, InterpolationJoiner) that merges dataframes on approximate string or datetime keys rather than requiring exact matches — critical when combining datasets from different sources where entity names are spelled differently. The library’s DataOps system goes further, enabling declarative computation graphs that embed hyperparameter search spaces directly into data preparation steps, so Optuna or scikit-learn’s cross-validation can explore preprocessing choices alongside model hyperparameters.

Built on top of scikit-learn’s estimator API, skrub components drop into existing pipelines without friction. It supports both pandas and Polars dataframes through an internal dispatch layer, and ships TableReport — a standalone HTML report with interactive plots, association heatmaps, and per-column distributions — for rapid data exploration before any modelling begins.

What You Get

  • TableVectorizer — auto-detects column types and applies the right encoder (one-hot, string, datetime, passthrough) to every column in a single transformer, producing a fully numeric design matrix
  • GapEncoder and MinHashEncoder — probabilistic encoders for high-cardinality, dirty string columns that capture morphological similarity without requiring clean, consistent values
  • Fuzzy join system — Joiner, AggJoiner, MultiAggJoiner, and InterpolationJoiner for merging dataframes on approximate string and datetime keys rather than exact matches
  • DatetimeEncoder — expands datetime columns into interpretable cyclical and linear numeric features (year, month, day-of-week, hour, total seconds) fit for tree or linear models
  • DataOps computation graphs — declarative DAG system that lets you embed choose_from/choose_float/choose_int hyperparameter spaces into data preparation steps and optimize them with Optuna or scikit-learn cross-validation
  • TableReport — interactive standalone HTML report with per-column distributions, association heatmaps, null counts, and sample rows, openable in a browser or embedded in notebooks
  • tabular_pipeline() — one-liner that wraps any scikit-learn estimator with appropriate preprocessing (TableVectorizer, imputer, optional scaler) based on estimator type
  • Pandas and Polars support — internal dataframe dispatch layer ensures all transformers work with both pandas and Polars dataframes without code changes

Common Use Cases

  • Rapid ML baseline on raw data — pass a CSV with mixed types, nulls, and string categories directly to tabular_pipeline(HistGradientBoostingClassifier()) and get a cross-validated baseline without writing any custom preprocessing
  • Entity resolution and fuzzy merging — join two company datasets where one uses ‘IBM Corp.’ and another ‘International Business Machines’ using Joiner to find approximate matches by embedding string similarity
  • Dirty categorical encoding — encode a ‘product_description’ column with thousands of unique values and misspellings using GapEncoder, which extracts latent topics from n-gram co-occurrence patterns
  • Automated hyperparameter search over preprocessing — use DataOps to express ‘try GapEncoder with 10 or 30 components’ as a searchable choice, then run ParamSearch to jointly optimize encoding and model parameters
  • Temporal feature engineering — feed raw datetime columns to DatetimeEncoder to expand them into sin/cos cyclical features plus linear time-since-epoch features without manual feature crafting
  • Pre-modelling data auditing — generate a TableReport for any dataframe to instantly see outlier distributions, high null columns, categorical cardinality, and inter-column associations before writing a single line of ML code

Under The Hood

Architecture skrub is organized as a flat package of composable, single-responsibility transformers built on top of scikit-learn’s BaseEstimator / TransformerMixin API. The core design pattern is a SingleColumnTransformer base class that encapsulates column-level logic, combined with a ApplyToCols meta-transformer that distributes a single-column transformer across all matching columns in a dataframe. TableVectorizer sits one level above, using a selector system and a dispatch layer (_dataframe) to route each column to the appropriate column-level transformer based on dtype. The DataOps system introduces a separate computation-graph abstraction on top of this stack: var() and X/y create symbolic nodes, and attribute accesses or method calls on those nodes construct a lazy DAG rather than executing immediately. The DAG is evaluated by SkrubLearner, which also interprets embedded choose_* nodes as hyperparameter spaces for Optuna or scikit-learn search.

Tech Stack skrub is a pure Python 3.10+ library published on PyPI and conda-forge. Core numeric computation relies on NumPy and SciPy, while matrix factorization in GapEncoder uses scikit-learn’s NMF primitives and KMeans for initialization. The string encoding pipeline uses scikit-learn’s CountVectorizer and HashingVectorizer for n-gram extraction. The fuzzy-join system applies TF-IDF vectorization followed by nearest-neighbor search from scikit-learn. Polars support is implemented through a runtime-dispatched abstraction layer in _dataframe/_common.py that detects the dataframe library and routes function calls accordingly — pyarrow is an optional dependency. HTML reporting uses Jinja2 templates with a shadow DOM custom element (pure.css, embedded JavaScript) for notebook-safe rendering. Build tooling uses setuptools with setuptools-scm for version management from git tags; pixi manages the multi-environment CI matrix; ruff handles linting and code style; CircleCI runs the test suite.

Code Quality skrub has extensive test coverage spanning 46 test files, one per module, covering unit tests, doctest integration (enabled globally in pytest config), and JavaScript Cypress end-to-end tests for the TableReport browser component. Pytest is configured with strict doctest mode (--doctest-modules) and treats all FutureWarning and DeprecationWarning as errors, enforcing forward-compatibility discipline. Type annotations are used selectively for public APIs. The codebase enforces consistent style with ruff (lines ≤88 chars, imports sorted, pyupgrade rules), and pre-commit hooks run ruff on every commit. CI runs across Python 3.10–3.14 with min-dependency pinned environments as well as nightly wheel environments to catch upstream breakage early. Documentation is thorough — full sphinx gallery with runnable examples, a CONTRIBUTING guide, API reference with numpydoc, and a vision statement explaining design philosophy.

What Makes It Unique The GapEncoder is the library’s most technically distinctive contribution: a Gamma-Poisson NMF model that treats an input string as a bag of character n-grams and factorizes the resulting count matrix to learn latent topics. This produces dense, continuous embeddings for dirty categorical data (typos, abbreviations, free text) that capture morphological similarity without any predefined vocabulary or label-encoding assumptions — something standard sklearn encoders cannot do. The fuzzy-join system is equally uncommon: rather than requiring exact key alignment, Joiner embeds join keys from both tables via TF-IDF and nearest-neighbor search, enabling approximate entity matching as a fit-transform step inside a scikit-learn pipeline. The DataOps system is a third differentiator: expressing preprocessing choices as lazy computation graphs that interoperate with Optuna means the entire ML pipeline — data wrangling choices included — is searchable as a joint hyperparameter space, an architecture not found in vanilla scikit-learn or pandas-based tooling.

Self-Hosting

skrub is released under the BSD 3-Clause License, one of the most permissive open-source licenses available. You can use it commercially, modify it, distribute it, and incorporate it into proprietary products without any copyleft obligations. The only requirements are that the copyright notice and license text be preserved in distributions, and that the skrub authors’ names not be used to endorse derivative products. For most data science teams and organizations, this license imposes no practical restrictions.

Running skrub yourself is straightforward from an operational standpoint because it is a Python library rather than a long-running service — there are no servers to provision, databases to manage, or uptime SLAs to maintain. It installs via pip or conda into any Python 3.10+ environment and its runtime footprint is limited to process memory during training and inference. The main operational considerations are reproducibility (pinning the skrub version in requirements files) and compute resources for large datasets, since GapEncoder and fuzzy-join operations on millions of rows can be memory-intensive and benefit from multi-core parallelism via joblib.

There is no official managed or hosted version of skrub — it is purely a library. This means there are no enterprise support tiers, no SLAs, no managed upgrades, and no vendor-provided monitoring. Community support is available through GitHub Issues, GitHub Discussions, and a Discord server maintained by the core team. For teams that need guaranteed support response times or commercial warranties, a support contract with a third-party vendor familiar with the PyData ecosystem would be required, as the skrub project itself does not offer this.

Join founders buildingwith open source

Opinionated takes, migration guides, cost-saving tips, and insights from the open source ecosystem.

Subscribe on Substack

No spam. Unsubscribe anytime.

Join 750+ subscribers
No spam. Unsubscribe anytime.

Search