autoloop: autoresearch for Everything
Karpathyβs autoresearch got 53K stars in a month. The idea: point an AI agent at a system you want to improve, give it a metric, let it run 100 experiments overnight. Wake up to a better system.
The loop was hardcoded to ML training. autoloop generalizes it to any domain.
What It Does
autoloop runs a tight experiment loop on any file you want to improve:
- Agent reads your
program.md(research goals) and the current file - Agent proposes and applies one modification
- Your metric function evaluates the result
- If the score improved β keep it, git commit
- If not β discard, restore previous version
- Repeat N times
Thatβs it. The loop handles everything else β backups, git history, logging, rollback.
Setup: 3 Minutes
pip install autoloop-ai
# Note: PyPI package is `autoloop-ai` (the name `autoloop` was already taken
# by an unrelated project). The import is still `from autoloop import ...`
# Pick your backend (one of these):
pip install anthropic # Anthropic API
pip install openai # OpenAI API
# OR install Ollama locally for zero cost: https://ollama.com
Three files you need:
1. The file you want to improve β whatever the agent will edit. Could be a Python function, a SQL query, a system prompt, a config file.
2. program.md β your research goals in plain English:
# Research Directives
## Goal
Improve the system prompt to increase task accuracy on customer support queries.
## Hypotheses to explore
- More specific role definition
- Explicit tone guidance
- Edge case handling instructions
## Constraints
- Keep under 400 tokens
- Must not make specific promises
3. Your run script:
from autoloop import AutoLoop, AnthropicBackend
# Define your metric β must return a float
def my_metric(target_path: str) -> float:
# Evaluate the file however makes sense for your use case
# Return higher = better (or set higher_is_better=False)
score = run_my_eval(target_path)
return score
loop = AutoLoop(
target="system_prompt.md", # file to optimize
metric=my_metric, # your eval function
directives="program.md", # research goals
backend=AnthropicBackend( # your LLM
model="claude-sonnet-4-5" # or claude-opus-4-5, etc.
),
higher_is_better=True,
)
loop.run(experiments=50)
# check autoloop-results/ when done
Bring Your Own Key
autoloop works with whatever LLM you have access to:
# Anthropic (Claude)
from autoloop import AnthropicBackend
backend = AnthropicBackend(model="claude-sonnet-4-5")
# Reads ANTHROPIC_API_KEY from env, or pass api_key= directly
# OpenAI (GPT-4o, o3, etc.)
from autoloop import OpenAIBackend
backend = OpenAIBackend(model="gpt-4o")
# Reads OPENAI_API_KEY from env, or pass api_key= directly
# Local β no API key, no cost
from autoloop import OllamaBackend
backend = OllamaBackend(model="llama3.1:8b")
# Requires Ollama running locally: ollama pull llama3.1:8b
# Claude Code CLI
from autoloop import ClaudeBackend
backend = ClaudeBackend()
# OpenAI Codex CLI
from autoloop import CodexBackend
backend = CodexBackend()
Set your key as an env var (recommended) or pass it directly:
export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-...
python3 run.py
Real Test Results
We ran autoloop on a naive recursive fibonacci with a timing + correctness metric. 4 experiments, no human involvement:
π Baseline: -0.172s (naive recursion, fibonacci(30))
π¬ Exp 1: Add memoization with dict cache
β
KEPT | -0.025s | 6.9x faster
π¬ Exp 2: Switch to iterative approach
β DISCARDED | -0.028s | slower than memoized
π¬ Exp 3: Wrong shortcut (returns 999)
β DISCARDED | -999.0 | correctness check failed
π¬ Exp 4: Use functools.lru_cache
β
KEPT | -0.022s | marginal improvement over exp 1
The loop correctly kept improvements, discarded regressions, and rejected broken code β all automatically, based purely on the metric.
What You Can Optimize
The pattern works for any domain with a quantifiable metric:
| Domain | Target file | Metric function |
|---|---|---|
| System prompts | prompt.md | LLM-as-judge score |
| Python functions | utils.py | Execution time / accuracy |
| SQL queries | query.sql | Query latency |
| Trading strategies | strategy.py | Sharpe ratio / win rate |
| RAG pipelines | retrieval.py | RAGAS / hit rate |
| Test suites | tests.py | Coverage score |
| API pipelines | pipeline.py | Latency / success rate |
The metric is the key design decision. It must be fast to evaluate (seconds, not minutes), unambiguous (one number, no judgment), and complete (rewards correctness, not just speed).
Built-in Metric Helpers
from autoloop.metrics import LLMJudgeMetric, LatencyMetric, AccuracyMetric, CompositeMetric
# LLM rates the file against a rubric
metric = LLMJudgeMetric(rubric="Rate this prompt 0-1 on clarity and accuracy")
# Measures execution time (lower = better, automatically negated)
metric = LatencyMetric(command="python3 {target}", runs=3)
# Runs a test command, expects a float on last line
metric = AccuracyMetric(test_command="python3 eval.py {target}")
# Combine multiple metrics with weights
metric = CompositeMetric([
(accuracy_metric, 0.7),
(latency_metric, 0.3),
])
CLI
autoloop history # full experiment log with scores
autoloop best # show winning version
autoloop rollback # restore best version to target file
What Makes a Good program.md
The directives file is what you iterate on over time. The better it is, the better your results. Good directives:
- State a clear, measurable goal
- List specific hypotheses to explore
- Document whatβs already been tried (autoloop appends this automatically)
- Set hard constraints (token limits, must-pass tests, etc.)
Think of it as the research brief youβd give a smart junior engineer. The more specific, the better.
PyPI: pip install autoloop-ai β note: the name autoloop was already taken on PyPI by an unrelated package. The PyPI package is autoloop-ai but all imports remain from autoloop import ...
Repo: github.com/menonpg/autoloop β MIT, contributions welcome.
Run the test suite yourself: python3 tests/test_core.py (no API key needed β uses mock backend).
We also opened a PR to karpathy/autoresearch adding autoloop to the related projects section β the loop design is directly inspired by autoresearch, felt right to close that loop.
Community
If this is useful, a β helps others find it.