ai-toolkit

Name: ai-toolkit
Rating: 5 (11195 reviews)

An all-in-one open-source training suite for finetuning diffusion models—FLUX, SDXL, Wan video, and audio—on consumer GPU hardware via CLI or web UI.

11.2Kstars

1.4Kforks

MIT License

Python

View Source

On This Page

AI Toolkit by Ostris is a comprehensive, community-driven training framework for finetuning modern diffusion models without requiring expensive cloud infrastructure. It supports an extensive roster of image, video, and audio models—from FLUX.1/2 and SDXL to Wan 2.x video models and ACE-Step audio—all from a single, unified codebase.

The toolkit ships both a command-line interface driven by YAML config files and a full web UI built on Next.js that lets you start, stop, and monitor training jobs in a browser. This dual-mode design means researchers can script headless training pipelines on remote servers while practitioners can point-and-click their way through LoRA training on a local workstation.

At its core, AI Toolkit wraps Hugging Face Diffusers and Accelerate with a plugin-style extension system. Built-in extensions cover the most common workflows out of the box—LoRA, DoRA, LyCORIS, concept sliders, dataset captioning, and full fine-tuning—while the extension API lets advanced users add custom training pipelines without forking the core.

The project is actively maintained with multiple commits per day and a thriving Discord community. Cloud GPU deployment is supported on RunPod, Modal, and the author’s own Ostris Cloud, making it straightforward to scale from a local 24 GB GPU to multi-node training when needed.

What You Get

Support for 30+ diffusion models including FLUX.1/2, SDXL, SD 1.5, Wan 2.1/2.2 video, Chroma, HiDream, LTX, and ACE-Step audio in a single installation
Multiple training methods in one toolkit: LoRA, DoRA, LyCORIS (LoCon, LoHa, LoKr), full fine-tuning, concept sliders, and reference adapters
A Next.js web UI running on port 8675 for launching, monitoring, and stopping jobs with optional auth token security
YAML config files with 24+ annotated example configs covering common model/GPU combinations, environment variable interpolation, and modal/RunPod deployment templates
Built-in dataset tools including multi-model auto-captioners (Qwen3-VL, Ideogram 4, ACE-Step), a SuperTagger, and aspect-ratio bucketing for mixed-resolution datasets
Memory management utilities, 8-bit and Prodigy optimizers, quantization via TorchAO and bitsandbytes, and gradient checkpointing for training on GPUs with as little as 12 GB VRAM

Common Use Cases

LoRA finetuning on FLUX models — training a lightweight adapter to inject a custom subject, style, or concept into FLUX.1-dev or FLUX.2 with a small captioned image dataset
Video model finetuning — adapting Wan 2.1/2.2 or LTX-2 video diffusion models to a specific motion style or character via LoRA on a 24 GB GPU
Concept slider training — learning directional edits (e.g., age, expression, lighting) using the concept slider extension without collecting paired training data
Headless cloud training — running jobs on RunPod, Modal, or Ostris Cloud using pre-built templates and mounted volumes, monitored remotely via the web UI or terminal
Multi-model experimentation — switching between SDXL, HiDream, and Chroma training runs using separate YAML configs and the same codebase installation
Custom extension development — building and dropping in project-specific training pipelines via the extension API without modifying core toolkit files

Under The Hood

Architecture AI Toolkit is built around a layered job-process model: a top-level runner dispatches YAML-configured job objects (TrainJob, GenerateJob, ExtractJob, ModJob, ExtensionJob), each of which instantiates one or more process objects that carry the actual training logic. The extension system decouples training pipelines from the core—built-in extensions like SDTrainer, ConceptSlider, and Captioner live in extensions_built_in/ and are loaded dynamically, keeping the core job runner agnostic of model-specific logic. Data flows from YAML configs through a preprocessing layer that resolves environment variables and template tags, into typed config dataclasses, then into Hugging Face Accelerate-backed training loops. The design handles multi-GPU distribution by delegating to Accelerate’s process group, so the core codebase stays single-process from the developer’s perspective. The extension API gives a well-defined hook surface, but the sheer breadth of supported models means individual model adapters (flux.py, wan21.py, etc.) carry substantial bespoke logic that doesn’t always generalise, creating depth rather than uniform modularity.

Tech Stack The runtime is Python 3.10+ with PyTorch as the tensor backend, targeting Nvidia CUDA (cu128 wheels) with experimental Apple Silicon support via MPS. Training orchestration uses Hugging Face Accelerate for distributed training and mixed-precision, Diffusers for model pipelines and schedulers, and PEFT for LoRA attachment. LoRA variants beyond standard PEFT are handled by LyCORIS-LoRA and in-house DoRA/iLoRA implementations. Quantization is covered by TorchAO and bitsandbytes. Optimizers include standard AdamW alongside Prodigy, Adafactor, and custom 8-bit Adam variants. Datasets are loaded through a custom bucketing data loader with Albumentations augmentations. The web UI is a Next.js 14 application with a Prisma ORM-backed SQLite job store and a Tailwind/Shadcn component library, served on port 8675. Config files use YAML via oyaml for key-order preservation, and safetensors is the standard checkpoint format.

Code Quality The codebase has a moderate test footprint: a testing/ directory contains targeted integration-style tests for VAE cycle consistency, bucket dataloader behaviour, model load/save round-trips, and LTX dataloader. There are no unit tests covering the extension system, optimizer wrappers, or config parsing. Error handling is functional but inconsistent—the main runner catches broad exceptions and optionally continues, while individual processes surface errors through an on_error hook, but many inner functions allow exceptions to propagate unguarded. Type annotations are present on public interfaces and config dataclasses but sparse inside training loops. The project uses no apparent formatter or linter configuration at the repo root (no pyproject.toml, no ruff/flake8 config), and CI is limited. Code inside model adapters tends to be dense and heavily commented, which aids readability but also reflects the inherent complexity of supporting dozens of model architectures without abstraction overhead.

What Makes It Unique What distinguishes AI Toolkit from alternatives like Kohya-ss or SimpleTuner is its breadth of model support combined with first-class extension composability and a genuinely usable web UI shipped as part of the same repository. The concept slider training methodology—learning directional edit vectors rather than subject-specific LoRAs—is a novel workflow that originated with Ostris and remains one of the toolkit’s signature capabilities. The project also maintains its own FLUX-derivative model (Flex.1/Flex.2) and contributes custom adapter architectures (iLoRA, LoRAFormer, mean-flow adapters) that extend beyond what upstream Diffusers ships. The tight coupling between model development and training tooling means new model architectures often land in AI Toolkit before any other training framework.

Self-Hosting

AI Toolkit is released under the MIT License, which is one of the most permissive open-source licenses available. You are free to use it commercially, modify the source code, redistribute it, and integrate it into proprietary products without any copyleft obligations. The only requirement is that you retain the copyright notice in any distribution. There are no restrictions on using trained model outputs commercially, though the licenses of the base models you finetune (e.g., FLUX.1-dev’s non-commercial license) are separate concerns you must evaluate independently.

Running AI Toolkit yourself requires Python 3.10+ and at minimum one Nvidia GPU with sufficient VRAM for your target model—typically 24 GB for FLUX LoRA training, though quantization and memory management utilities can push that lower. You are responsible for provisioning the hardware, managing GPU drivers and CUDA versions, installing dependencies from the pinned requirements files, and keeping up with upstream changes to Diffusers and Transformers, which the project tracks closely. The codebase ships no containerized production deployment story beyond a basic Dockerfile; operational concerns like persistent storage, job queuing across multiple GPUs, and fault tolerance fall entirely on the operator.

There is no paid cloud tier from the project itself, though the author operates Ostris Cloud as a first-party GPU rental service that directly funds development. Compared to managed platforms like Replicate or Hugging Face AutoTrain, self-hosting gives you full control over training hyperparameters, custom extension code, and model output storage, but you give up managed queuing, automatic scaling, dataset hosting, and dedicated support. The project’s primary community support channel is a public Discord server, and there is no commercial SLA or enterprise support offering.