autoresearch

Name: autoresearch
Rating: 5 (89809 reviews)

Give an AI agent a real LLM training setup and let it experiment autonomously overnight — you wake up to a log of experiments and (hopefully) a better model.

89.8Kstars

13Kforks

MIT License

Python

View Source

On This Page

autoresearch is a minimal, single-GPU autonomous research framework created by Andrej Karpathy that hands the wheel of LLM pretraining to an AI agent. Instead of a researcher manually tweaking model architecture, optimizer settings, and hyperparameters, an AI agent reads a Markdown instruction file, modifies the training script, runs a 5-minute experiment, checks whether the metric improved, and then either keeps or discards the change — repeating indefinitely until manually stopped.

The core design philosophy is radical minimalism: the entire codebase is three files. A fixed preparation script handles data downloading and tokenizer training and is never touched. A single training script — containing the full GPT model definition, the Muon+AdamW optimizer, and the training loop — is the only file the agent is allowed to modify. A Markdown program file gives the agent its research goals and behavioral instructions. This tight scope keeps experiments reviewable as diffs and metrics directly comparable.

The evaluation metric is validation bits-per-byte (val_bpb), a vocabulary-size-independent measure that allows fair comparison regardless of architectural changes the agent might try. Training always runs for exactly 5 minutes of wall-clock time, giving roughly 12 experiments per hour and around 100 experiments during a typical overnight run. The dataset is Karpathy’s climbmix-400b-shuffle, a large-scale pretraining corpus streamed from Hugging Face.

Under the hood, the default training script includes a state-of-the-art baseline: Flash Attention 3 with sliding window patterns, RMSNorm, Rotary Position Embeddings (RoPE), value residuals (ResFormer-style), the Muon optimizer with Polar Express orthogonalization, NorMuon variance reduction, and cautious weight decay. The agent is free to experiment with all of these. The project attracted massive community interest and has spawned community forks targeting macOS, Windows, and AMD GPUs.

What You Get

A three-file minimal codebase: fixed data prep, agent-editable training script, and human-editable research program instructions
A state-of-the-art GPT baseline with Flash Attention 3, RoPE, RMSNorm, value residuals, and Muon optimizer out of the box
A fixed 5-minute wall-clock time budget that makes all experiments directly comparable regardless of architecture changes
Vocabulary-size-independent val_bpb evaluation metric so the agent can fairly compare models with different vocab sizes or tokenizers
Best-fit packing dataloader with 100% GPU utilization and no padding waste
Git-branch-per-run workflow that keeps every experimental change reviewable as a diff
A TSV results log tracking commit hash, val_bpb, VRAM usage, keep/discard status, and experiment description
Community forks with platform support for macOS (MLX), Windows (RTX), and AMD GPUs

Common Use Cases

Overnight autonomous research: Leave an AI agent running while you sleep to explore hundreds of architecture and optimizer variants and wake up to a ranked results log
Hyperparameter search: Point the agent at your training setup to automatically tune learning rates, batch sizes, optimizer betas, warmup and cooldown schedules without manual intervention
Architecture ablations: Let the agent systematically test variants of attention window patterns, model depth, head dimensions, and residual connections to find the best configuration for your GPU
Optimizer research: Use the editable training script as a testbed for trying new optimizer ideas — the agent can modify the Muon and AdamW implementations and measure impact
Educational ML exploration: Study a high-quality, production-grade single-file GPT implementation that includes modern techniques like Polar Express orthogonalization and NorMuon variance reduction
Custom pretraining research: Fork and adapt the research program Markdown file to direct agent experiments toward specific goals such as minimizing VRAM, maximizing throughput, or improving specific downstream tasks

Under The Hood

Architecture The system follows a tight two-layer separation of concerns: a fixed substrate layer that cannot be modified by the agent, and a mutable experiment layer that is entirely the agent’s domain. The fixed layer consists of data loading with best-fit packing, BPE tokenization, and a locked evaluation harness. The mutable layer is a single Python file containing the full GPT model, optimizer, and training loop. An agent instruction file written in Markdown acts as the “research org program” — a human-authored document that sets goals, constraints, and behavioral rules for the AI agent. The branching strategy (one git branch per experimental session, commits for every attempt, resets on failure) provides a natural experiment ledger without requiring any custom tooling.

Tech Stack The project runs on Python 3.10+ with PyTorch 2.9 as the sole deep learning framework, managed by the uv package manager. Flash Attention 3 is loaded at runtime via the kernels package with automatic fallback between Hopper-only (varunneal/flash-attention-3) and broader GPU (kernels-community/flash-attn3) implementations based on detected GPU capability. The BPE tokenizer is trained using rustbpe (a Rust-backed implementation) and wrapped with tiktoken for encoding. Training data is stored as Parquet shards downloaded from Hugging Face, read via pyarrow. The Muon optimizer is implemented from scratch in a fused torch.compile kernel using Polar Express polynomial approximation for matrix orthogonalization, combined with NorMuon variance reduction and cautious weight decay. All model and optimizer steps are torch.compile’d with dynamic=False for maximum throughput.

Code Quality The codebase is intentionally small — approximately 630 lines of training code and 390 lines of preparation utilities. There are no automated tests, no CI configuration, and no linter setup, which is consistent with its nature as a research prototype rather than production software. Code quality in the traditional sense is secondary to clarity of design intent: each section is delineated with comment banners, constants are grouped and annotated, and the data flow from raw parquet files through tokenization to GPU tensors is straightforward to trace. Error handling is minimal but appropriate — fast-fail on NaN or exploding loss, retry logic in the data downloader, and assertion guards on tensor shapes. The fused optimizer kernels rely on scalar CPU tensors to avoid torch.compile recompilations as hyperparameters change, which is a notably careful implementation detail.

What Makes It Unique The fundamental novelty is the inversion of the human-machine research loop: instead of a human running experiments and an AI assisting, the AI runs experiments autonomously and the human programs the research org via a Markdown file. The fixed time budget design is particularly clever — by normalizing every experiment to exactly 5 minutes, the agent can fairly compare models of wildly different sizes, architectures, and batch sizes without needing to reason about training efficiency. The choice of val_bpb as the metric eliminates vocabulary size as a confound when the agent tries different tokenizer configurations. The inclusion of a state-of-the-art baseline (Flash Attention 3, Muon with Polar Express, value residuals, NorMuon) means the agent starts from a highly competitive position rather than a toy setup, making the research results genuinely meaningful.

Self-Hosting

autoresearch is released under the MIT License, which is one of the most permissive open-source licenses available. You can use it commercially, modify it freely, distribute your changes without restriction, and there is no copyleft obligation — your own training scripts and research results remain entirely your own property. The only requirement is preserving the copyright notice.

Running autoresearch yourself requires a single NVIDIA GPU — the code was tested on an H100 and expects CUDA support for Flash Attention 3. You are responsible for provisioning the GPU instance, managing the Python environment via uv, downloading the training data shards from Hugging Face (which can be several gigabytes), and keeping the agent running without interruption for the duration of your research session. There is no persistent server, no job queue, and no monitoring infrastructure — the agent runs as a foreground process inside your coding assistant, and if it crashes or the terminal closes, the session ends.

There is no hosted or SaaS version of autoresearch, no paid tier, and no managed service. What you get is entirely what is in the repository. This means no support SLA, no managed data storage, no automated experiment tracking beyond the local TSV file, and no collaboration features for running experiments across multiple GPUs or team members. If you need distributed training, reproducible experiment management at scale, or team-based research workflows, you would need to integrate this approach with additional tooling such as Weights & Biases, Ray, or a job scheduler — none of which are provided here.

Related Apps

Rust

95%

MIT

claw-code

AI Agents · AI Code Assistants

194,567

A Rust-built CLI agent harness for Claude AI with persistent sessions, MCP tool integration, plugin hooks, and multi-provider support — designed to run autonomous coding workflows without human babysitting.

View details