Karpathy's autoresearch: You Describe the Goal, the Agent Runs Science Overnight

By Prahlad Menon 5 min read

Andrej Karpathy opens the autoresearch README with a bit of fiction that lands closer to prophecy than joke:

“One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other fun, and synchronizing once in a while using sound wave interconnect in the ritual of ‘group meeting’. That era is long gone.”

Then: “This repo is the story of how it all began.”

autoresearch — 53K stars in under a month — is a framework where you describe your research objectives in a Markdown file and let an AI agent run experiments autonomously overnight. You wake up to a git log of everything it tried, what worked, what didn’t, and a better model than you had when you went to sleep.

The Core Idea

The setup is deliberately minimal. Three files that matter:

  • prepare.py — fixed constants, one-time data prep, runtime utilities. The agent never touches this.
  • train.py — the full GPT model, optimizer (Muon + AdamW), and training loop. This is what the agent edits. Architecture, hyperparameters, optimizer, batch size — everything is fair game.
  • program.md — your instructions to the agent. This is what you write and iterate on. Karpathy calls it “programming the program.”

The training setup is a simplified single-GPU implementation of nanochat. Each experiment runs for a fixed 5-minute wall-clock time budget. The metric is val_bpb — validation bits per byte, lower is better, vocab-size-independent so architectural changes are fairly comparable.

Fixed time budget means you get approximately 12 experiments per hour, ~100 experiments overnight. Each one: modify train.py, train for 5 minutes, evaluate, keep or discard, repeat.

What the Agent Actually Does

You spin up Claude Code, Codex, or any capable agent in the repo, point it at program.md, and prompt:

“Have a look at program.md and let’s kick off a new experiment.”

The agent reads your research directives, proposes a modification to train.py (a new attention variant, a different optimizer config, a batch size change, a positional encoding experiment), runs it, checks if val_bpb improved, commits if it did, discards if it didn’t, then proposes the next experiment.

The loop is entirely autonomous. Your involvement is in program.md — describing what you want the agent to explore, what hypotheses to prioritize, what constraints to respect. The better your program.md, the better your research org.

Why the Design Decisions Matter

Single file to modify. The agent only touches train.py. Keeps scope manageable, diffs reviewable, and prevents the agent from breaking unrelated infrastructure while exploring ideas.

Fixed time budget. This is the most important design decision. By making every experiment exactly 5 minutes regardless of what changed, experiments are directly comparable across architectural changes. A bigger model and a smaller model with faster training both get the same wall-clock budget — the metric captures which is genuinely better, not which runs faster on your specific hardware. It also means the agent can’t game the evaluation by picking configurations that train fast but generalize poorly.

One metric, unambiguous. val_bpb goes down or it doesn’t. There’s no interpretation, no judgment call, no “well it’s better in some ways.” This is what enables full autonomy — the agent has a clear signal to optimize against.

This feedback loop design — unambiguous binary outcome, fixed evaluation budget, single file scope — is what makes autoresearch actually work rather than being another “autonomous agent” demo that requires constant human intervention to stay on track.

The Broader Principle: Feedback Loop Quality

The reason autoresearch works is the same reason Polymarket prediction markets are a better training ground for AI agents than equity markets: the feedback is fast, unambiguous, and frequent.

ML research traditionally has terrible feedback loops. A hypothesis takes weeks to test. Results are ambiguous. It’s hard to know if an improvement is real or noise. autoresearch collapses that to 5 minutes and a single number.

Wherever you can engineer a tight, unambiguous feedback loop, autonomous agents can run science at a pace humans can’t match. That’s the generalizable insight from this project — not the specific ML training setup, but the loop design.

Requirements and Setup

Single NVIDIA GPU (tested on H100). Python 3.10+. uv package manager.

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install dependencies
uv sync

# One-time data prep (~2 min)
uv run prepare.py

# Test a single manual experiment (~5 min)
uv run train.py

# Then hand it to your agent

Point Claude Code or Codex at the repo, disable all permissions except train.py edits, and prompt it to read program.md and start experimenting.

The program.md defaults included in the repo are intentionally minimal — a bare bones baseline. Karpathy’s explicit point is that iterating on program.md over time is how you build a better “research org.” The code the agent writes is temporary; the research directives you write are the compounding asset.

What This Unlocks

The obvious application is ML research — faster architecture iteration, hyperparameter search, optimizer experiments. But the framework generalizes to any domain with a fast, quantitative evaluation metric.

The meta-insight from the README intro isn’t just rhetoric. The bottleneck in research has always been the human feedback loop: propose, implement, evaluate, repeat — with sleep and meetings in between. Autoresearch removes the human from that loop for everything except goal-setting. That changes the economics of what’s explorable.

Repo: github.com/karpathy/autoresearch — 53K stars, March 2026