autoloop: autoresearch for Everything

By Prahlad Menon Published 2026-04-01 3 min read

Karpathy’s autoresearch got 53K stars in a month. The idea: point an AI agent at a system you want to improve, give it a metric, let it run 100 experiments overnight. Wake up to a better system.

The loop was hardcoded to ML training. autoloop generalizes it to any domain.

What It Does

autoloop runs a tight experiment loop on any file you want to improve:

Agent reads your program.md (research goals) and the current file
Agent proposes and applies one modification
Your metric function evaluates the result
If the score improved → keep it, git commit
If not → discard, restore previous version
Repeat N times

That’s it. The loop handles everything else — backups, git history, logging, rollback.

Setup: 3 Minutes

pip install autoloop-ai

# Note: PyPI package is `autoloop-ai` (the name `autoloop` was already taken
# by an unrelated project). The import is still `from autoloop import ...`

# Pick your backend (one of these):
pip install anthropic   # Anthropic API
pip install openai      # OpenAI API
# OR install Ollama locally for zero cost: https://ollama.com

Three files you need:

1. The file you want to improve — whatever the agent will edit. Could be a Python function, a SQL query, a system prompt, a config file.

2. program.md — your research goals in plain English:

# Research Directives

## Goal
Improve the system prompt to increase task accuracy on customer support queries.

## Hypotheses to explore
- More specific role definition
- Explicit tone guidance
- Edge case handling instructions

## Constraints
- Keep under 400 tokens
- Must not make specific promises

3. Your run script:

from autoloop import AutoLoop, AnthropicBackend

# Define your metric — must return a float
def my_metric(target_path: str) -> float:
    # Evaluate the file however makes sense for your use case
    # Return higher = better (or set higher_is_better=False)
    score = run_my_eval(target_path)
    return score

loop = AutoLoop(
    target="system_prompt.md",      # file to optimize
    metric=my_metric,               # your eval function
    directives="program.md",        # research goals
    backend=AnthropicBackend(       # your LLM
        model="claude-sonnet-4-5"   # or claude-opus-4-5, etc.
    ),
    higher_is_better=True,
)

loop.run(experiments=50)
# check autoloop-results/ when done

Bring Your Own Key

autoloop works with whatever LLM you have access to:

# Anthropic (Claude)
from autoloop import AnthropicBackend
backend = AnthropicBackend(model="claude-sonnet-4-5")
# Reads ANTHROPIC_API_KEY from env, or pass api_key= directly

# OpenAI (GPT-4o, o3, etc.)
from autoloop import OpenAIBackend
backend = OpenAIBackend(model="gpt-4o")
# Reads OPENAI_API_KEY from env, or pass api_key= directly

# Local — no API key, no cost
from autoloop import OllamaBackend
backend = OllamaBackend(model="llama3.1:8b")
# Requires Ollama running locally: ollama pull llama3.1:8b

# Claude Code CLI
from autoloop import ClaudeBackend
backend = ClaudeBackend()

# OpenAI Codex CLI
from autoloop import CodexBackend
backend = CodexBackend()

Set your key as an env var (recommended) or pass it directly:

export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-...
python3 run.py

Real Test Results

We ran autoloop on a naive recursive fibonacci with a timing + correctness metric. 4 experiments, no human involvement:

📊 Baseline: -0.172s  (naive recursion, fibonacci(30))

🔬 Exp 1: Add memoization with dict cache
✅ KEPT     | -0.025s | 6.9x faster

🔬 Exp 2: Switch to iterative approach
❌ DISCARDED | -0.028s | slower than memoized

🔬 Exp 3: Wrong shortcut (returns 999)
❌ DISCARDED | -999.0  | correctness check failed

🔬 Exp 4: Use functools.lru_cache
✅ KEPT     | -0.022s | marginal improvement over exp 1

The loop correctly kept improvements, discarded regressions, and rejected broken code — all automatically, based purely on the metric.

What You Can Optimize

The pattern works for any domain with a quantifiable metric:

Domain	Target file	Metric function
System prompts	`prompt.md`	LLM-as-judge score
Python functions	`utils.py`	Execution time / accuracy
SQL queries	`query.sql`	Query latency
Trading strategies	`strategy.py`	Sharpe ratio / win rate
RAG pipelines	`retrieval.py`	RAGAS / hit rate
Test suites	`tests.py`	Coverage score
API pipelines	`pipeline.py`	Latency / success rate

The metric is the key design decision. It must be fast to evaluate (seconds, not minutes), unambiguous (one number, no judgment), and complete (rewards correctness, not just speed).

Built-in Metric Helpers

from autoloop.metrics import LLMJudgeMetric, LatencyMetric, AccuracyMetric, CompositeMetric

# LLM rates the file against a rubric
metric = LLMJudgeMetric(rubric="Rate this prompt 0-1 on clarity and accuracy")

# Measures execution time (lower = better, automatically negated)
metric = LatencyMetric(command="python3 {target}", runs=3)

# Runs a test command, expects a float on last line
metric = AccuracyMetric(test_command="python3 eval.py {target}")

# Combine multiple metrics with weights
metric = CompositeMetric([
    (accuracy_metric, 0.7),
    (latency_metric, 0.3),
])

CLI

autoloop history     # full experiment log with scores
autoloop best        # show winning version
autoloop rollback    # restore best version to target file

What Makes a Good `program.md`

The directives file is what you iterate on over time. The better it is, the better your results. Good directives:

State a clear, measurable goal
List specific hypotheses to explore
Document what’s already been tried (autoloop appends this automatically)
Set hard constraints (token limits, must-pass tests, etc.)

Think of it as the research brief you’d give a smart junior engineer. The more specific, the better.

PyPI: pip install autoloop-ai — note: the name autoloop was already taken on PyPI by an unrelated package. The PyPI package is autoloop-ai but all imports remain from autoloop import ...

Repo: github.com/menonpg/autoloop — MIT, contributions welcome.

Run the test suite yourself: python3 tests/test_core.py (no API key needed — uses mock backend).

We also opened a PR to karpathy/autoresearch adding autoloop to the related projects section — the loop design is directly inspired by autoresearch, felt right to close that loop.

Community

If this is useful, a ⭐ helps others find it.