Give an AI agent a real LLM training setup and let it experiment autonomously overnight — you wake up to a log of experiments and (hopefully) a better model.
autoresearch is a minimal, single-GPU autonomous research framework created by Andrej Karpathy that hands the wheel of LLM pretraining to an AI agent. Instead of a researcher manually tweaking model architecture, optimizer settings, and hyperparameters, an AI agent reads a Markdown instruction file, modifies the training script, runs a 5-minute experiment, checks whether the metric improved, and then either keeps or discards the change — repeating indefinitely until manually stopped.
The core design philosophy is radical minimalism: the entire codebase is three files. A fixed preparation script handles data downloading and tokenizer training and is never touched. A single training script — containing the full GPT model definition, the Muon+AdamW optimizer, and the training loop — is the only file the agent is allowed to modify. A Markdown program file gives the agent its research goals and behavioral instructions. This tight scope keeps experiments reviewable as diffs and metrics directly comparable.
The evaluation metric is validation bits-per-byte (val_bpb), a vocabulary-size-independent measure that allows fair comparison regardless of architectural changes the agent might try. Training always runs for exactly 5 minutes of wall-clock time, giving roughly 12 experiments per hour and around 100 experiments during a typical overnight run. The dataset is Karpathy’s climbmix-400b-shuffle, a large-scale pretraining corpus streamed from Hugging Face.
Under the hood, the default training script includes a state-of-the-art baseline: Flash Attention 3 with sliding window patterns, RMSNorm, Rotary Position Embeddings (RoPE), value residuals (ResFormer-style), the Muon optimizer with Polar Express orthogonalization, NorMuon variance reduction, and cautious weight decay. The agent is free to experiment with all of these. The project attracted massive community interest and has spawned community forks targeting macOS, Windows, and AMD GPUs.
Architecture The system follows a tight two-layer separation of concerns: a fixed substrate layer that cannot be modified by the agent, and a mutable experiment layer that is entirely the agent’s domain. The fixed layer consists of data loading with best-fit packing, BPE tokenization, and a locked evaluation harness. The mutable layer is a single Python file containing the full GPT model, optimizer, and training loop. An agent instruction file written in Markdown acts as the “research org program” — a human-authored document that sets goals, constraints, and behavioral rules for the AI agent. The branching strategy (one git branch per experimental session, commits for every attempt, resets on failure) provides a natural experiment ledger without requiring any custom tooling.
Tech Stack
The project runs on Python 3.10+ with PyTorch 2.9 as the sole deep learning framework, managed by the uv package manager. Flash Attention 3 is loaded at runtime via the kernels package with automatic fallback between Hopper-only (varunneal/flash-attention-3) and broader GPU (kernels-community/flash-attn3) implementations based on detected GPU capability. The BPE tokenizer is trained using rustbpe (a Rust-backed implementation) and wrapped with tiktoken for encoding. Training data is stored as Parquet shards downloaded from Hugging Face, read via pyarrow. The Muon optimizer is implemented from scratch in a fused torch.compile kernel using Polar Express polynomial approximation for matrix orthogonalization, combined with NorMuon variance reduction and cautious weight decay. All model and optimizer steps are torch.compile’d with dynamic=False for maximum throughput.
Code Quality The codebase is intentionally small — approximately 630 lines of training code and 390 lines of preparation utilities. There are no automated tests, no CI configuration, and no linter setup, which is consistent with its nature as a research prototype rather than production software. Code quality in the traditional sense is secondary to clarity of design intent: each section is delineated with comment banners, constants are grouped and annotated, and the data flow from raw parquet files through tokenization to GPU tensors is straightforward to trace. Error handling is minimal but appropriate — fast-fail on NaN or exploding loss, retry logic in the data downloader, and assertion guards on tensor shapes. The fused optimizer kernels rely on scalar CPU tensors to avoid torch.compile recompilations as hyperparameters change, which is a notably careful implementation detail.
What Makes It Unique The fundamental novelty is the inversion of the human-machine research loop: instead of a human running experiments and an AI assisting, the AI runs experiments autonomously and the human programs the research org via a Markdown file. The fixed time budget design is particularly clever — by normalizing every experiment to exactly 5 minutes, the agent can fairly compare models of wildly different sizes, architectures, and batch sizes without needing to reason about training efficiency. The choice of val_bpb as the metric eliminates vocabulary size as a confound when the agent tries different tokenizer configurations. The inclusion of a state-of-the-art baseline (Flash Attention 3, Muon with Polar Express, value residuals, NorMuon) means the agent starts from a highly competitive position rather than a toy setup, making the research results genuinely meaningful.
autoresearch is released under the MIT License, which is one of the most permissive open-source licenses available. You can use it commercially, modify it freely, distribute your changes without restriction, and there is no copyleft obligation — your own training scripts and research results remain entirely your own property. The only requirement is preserving the copyright notice.
Running autoresearch yourself requires a single NVIDIA GPU — the code was tested on an H100 and expects CUDA support for Flash Attention 3. You are responsible for provisioning the GPU instance, managing the Python environment via uv, downloading the training data shards from Hugging Face (which can be several gigabytes), and keeping the agent running without interruption for the duration of your research session. There is no persistent server, no job queue, and no monitoring infrastructure — the agent runs as a foreground process inside your coding assistant, and if it crashes or the terminal closes, the session ends.
There is no hosted or SaaS version of autoresearch, no paid tier, and no managed service. What you get is entirely what is in the repository. This means no support SLA, no managed data storage, no automated experiment tracking beyond the local TSV file, and no collaboration features for running experiments across multiple GPUs or team members. If you need distributed training, reproducible experiment management at scale, or team-based research workflows, you would need to integrate this approach with additional tooling such as Weights & Biases, Ray, or a job scheduler — none of which are provided here.
No Code Platforms · AI Development · Developer Tools
Visual LLM workflow platform with RAG pipelines, agent capabilities, and model management for building production AI applications.
AI Code Assistants · AI Development
Orchestrate an army of AI coding agents—Claude Code, Codex, Gemini CLI, and more—running simultaneously in isolated git worktrees from a single Electron desktop app.
AI Code Assistants · AI Development
The self-hosted developer control center for running AI coding agents — locally, in Docker, on VMs, or across cloud backends — with automation workflows for GitHub, Slack, and more.