ART

Name: ART
Rating: 5 (10288 reviews)

Give your LLM agents on-the-job training—ART lets you apply GRPO reinforcement learning to any multi-step agentic workflow with minimal code changes.

10.3Kstars

931forks

Apache License 2.0

Python

View Source Visit Website

On This Page

ART (Agent Reinforcement Trainer) is an open-source Python framework by OpenPipe that brings reinforcement learning directly into multi-step LLM agent workflows. Rather than requiring practitioners to become RL researchers, ART wraps the complexity of GRPO training behind an ergonomic client/server architecture that slots into any existing Python application. You define your agent workflow, assign rewards at the end of each rollout, and ART handles the rest—gradient updates, LoRA checkpointing, and inference routing through vLLM.

The framework ships two execution modes: a local backend for teams with their own GPU infrastructure, and a serverless backend powered by W&B Training, which manages GPU provisioning, inference clusters, and checkpoint publishing automatically. Both expose an identical API so switching between them requires only a one-line backend swap. Out of the box, ART supports Qwen 3, Llama, Mistral, and any model compatible with vLLM and Unsloth.

Beyond vanilla GRPO, ART introduces RULER—a zero-labeled-data reward function that uses an LLM judge to relatively rank trajectories within each training group, removing the need for handcrafted reward engineering for most tasks. AutoRL extends this further by generating training inputs automatically, enabling training on tasks where no dataset exists at all. Integrations with LangGraph, MCP servers, W&B, Langfuse, and OpenPipe’s own observability platform provide production-grade debugging and monitoring throughout the training loop.

Since its release in March 2025 ART has accumulated over 10,000 GitHub stars and 58 releases, driven by demonstrated results—a Qwen 2.5 14B email-retrieval agent trained with ART outperformed OpenAI’s o3 on the benchmark the team designed.

What You Get

OpenAI-compatible client proxy — drop-in replacement for the OpenAI SDK that transparently routes completions through the training server and records every message into typed Trajectory objects without changing your existing agent code
GRPO training loop with LoRA — each batch of completed rollouts is grouped, rewards are compared within groups, and the model is updated via GRPO using Unsloth-accelerated LoRA fine-tuning, then immediately hot-swapped into the running vLLM inference engine
Dual backend — LocalBackend for on-premises GPU clusters with full vLLM management, and ServerlessBackend (W&B Training) for fully managed infrastructure at 40% lower cost with 28% faster training on shared production GPU clusters
RULER reward function — an LLM-as-judge system that ranks trajectories relative to each other within a group using any frontier model, eliminating the need for labeled training data or hand-crafted scalar reward functions
AutoRL zero-data training — automatically generates diverse task inputs using an LLM and evaluates outcomes with RULER, enabling training on completely new tasks without any pre-existing dataset
LangGraph and MCP integrations — wrap_rollout() and init_chat_model() adapters wire ART’s trajectory collection directly into LangGraph agent graphs, while the MCP module teaches models to use any MCP server’s tools through RL
Observability pipeline — built-in cost tracking per API call, W&B metrics logging, Langfuse tracing, and Parquet trajectory export for debugging reward distributions and training dynamics
SFT + RL pipeline — supports supervised fine-tuning warmup before RL, distillation from larger models, and combined SFT+RL training schedules with configurable gradient accumulation and checkpoint retention

Common Use Cases

Email and document research agents — training a smaller open-weight model (7B–27B) to search, retrieve, and synthesize information across large corpora to match or exceed the accuracy of frontier closed models
Tool-using API agents — fine-tuning models to reliably call sequences of tools in the right order and with correct arguments, using reward signals derived from end-task success rather than individual tool-call correctness
MCP server mastery — teaching a model to effectively use a specific MCP server’s tool surface through reinforcement learning, so it learns which tools to invoke in what sequence for a given request
Game-playing and puzzle-solving — training agents on deterministic environments (2048, Tic Tac Toe, Codenames, Temporal Clue) where the reward signal is unambiguous, validating convergence and exploring exploration/exploitation tradeoffs
Multi-step reasoning pipelines — improving LangGraph-based workflows that require chained reasoning steps, where GRPO can optimize the full trajectory end-to-end rather than just individual steps
Low-data domain specialization — using AutoRL to bootstrap training on proprietary enterprise tasks where no existing fine-tuning dataset exists, generating synthetic inputs and judging outputs with RULER

Under The Hood

Architecture ART is organized around a strict client/server boundary: the client library runs inside the user’s agent process and exposes an OpenAI-compatible chat completions proxy that records every system, user, and assistant message into typed Trajectory objects, while the server runs as a separate GPU-bearing process that owns the vLLM inference engine and the GRPO training loop. This separation means the agent code never directly touches the model weights or training state—it simply accumulates trajectories and assigns a scalar reward, then calls train() to hand off a batch. The Backend protocol abstraction allows the server to be either a local subprocess or a remote managed service (W&B Training) with zero changes to the agent code, making the architecture genuinely portable across infrastructure tiers.

Tech Stack The client library is pure Python 3.12 with OpenAI SDK, LiteLLM (for multi-provider LLM judge calls in RULER), Polars for trajectory data manipulation, and Pydantic for all data models. The backend server depends on PyTorch, the HuggingFace Transformers and PEFT stacks, TRL for GRPO loss computation, Unsloth for memory-efficient LoRA training, and vLLM for high-throughput inference. Megatron-LM support is available as an optional extra for multi-node training at scale. Build tooling uses uv with hatchling, ruff for linting, and ty for type checking. The project ships as the openpipe-art PyPI package with optional dependency groups separating client-only from full backend installs.

Code Quality ART demonstrates comprehensive testing with an extensive unit test suite covering trajectory manipulation, metrics taxonomy, GRPO preprocessing, LoRA weight merging, LangGraph integration, MCP client behavior, RULER metrics, and cost tracking—plus integration tests that exercise the local backend training pipeline end to end. The codebase uses full type annotations throughout, enforced by ty (Astral’s type checker) running in CI alongside ruff for linting and formatting. Error handling is explicit and typed via custom exception classes with structured tracebacks captured in PydanticException models. The Backend interface is defined as a Protocol, ensuring that LocalBackend and ServerlessBackend remain structurally compatible without inheritance coupling.

What Makes It Unique ART’s most distinctive technical contribution is making GRPO a drop-in capability for existing Python agent code rather than a research experiment requiring specialized infrastructure knowledge. RULER removes the labeled-data bottleneck that has historically made RL for agents impractical for most teams: by using relative LLM-judge ranking within trajectory groups, it exploits the same insight GRPO relies on (relative rewards within a group suffice) without requiring any reference answers. AutoRL compounds this by eliminating the training data requirement entirely. The serverless backend’s multiplexed shared-cluster design delivers economics previously unavailable to individual researchers or small teams, while the LangGraph and MCP integrations mean ART works with the agent frameworks developers are already using rather than requiring a rewrite.

Self-Hosting

ART is released under the Apache License 2.0, which is one of the most permissive open-source licenses available. You can use it commercially, modify it, distribute it, and embed it in proprietary products without any obligation to open-source your own code. There are no copyleft implications—your reward functions, agent workflows, and trained LoRA adapters remain entirely your intellectual property. The only requirements are that you preserve the copyright notice and license text when redistributing the library itself.

Running ART locally demands meaningful GPU resources: training even a 7B model with GRPO requires at least a single A100 or H100-class GPU for the backend server, and the vLLM inference engine needs additional VRAM for hot-swapping LoRA adapters between rollout and training phases. You are responsible for provisioning and maintaining these GPUs, managing vLLM process health, storing checkpoints durably, and handling CUDA errors or OOM conditions that arise from large batch sizes or long trajectories. The recommended setup uses Docker with NVIDIA container runtime, and the project provides SkyPilot configuration for spinning up cloud VMs, but the operational responsibility for uptime, scaling, and cost control rests entirely with you.

The W&B Training serverless backend (operated by Weights & Biases) offloads all of that operational burden: GPU provisioning, inference cluster health, checkpoint storage, and HA are managed for you. Pricing is consumption-based and benchmarked at 40% lower cost than running equivalent dedicated GPU instances, with 28% faster training due to multiplexed shared clusters. Every checkpoint is automatically published to W&B and available for immediate inference via W&B Inference. Teams evaluating self-hosting versus managed should weigh the full operational cost of GPU maintenance, monitoring, and on-call rotation against the serverless pricing—for most teams without existing ML infrastructure, the serverless path will have lower total cost of ownership despite higher nominal per-token pricing.

On This Page