OpenClaw-RL is an open-source reinforcement learning framework that trains an agent continuously from its own interactions — conversations, terminal commands, GUI actions, tool calls — without any manual data labeling or separate training runs. Every user reply, correction, test result, or error log becomes a learning signal that updates the model live in the background.

What are evaluative and directive signals?

Evaluative signals answer 'did the action work?' — a user re-asking the same question means failure; a test passing means success. These are extracted as scalar rewards via a Process Reward Model (PRM) judge. Directive signals answer 'how should the action have been different?' — user corrections and error logs provide direct textual guidance. These are recovered through Hindsight-Guided On-Policy Distillation (OPD) and give richer token-level supervision than any scalar reward alone.

What is Hindsight-Guided On-Policy Distillation (OPD)?

OPD extracts textual hints from the next state after each action — what the user corrected, what the error message said, what the terminal output showed — builds an enhanced teacher context from them, and provides token-level directional advantage supervision. Instead of just telling the model 'that was wrong' (scalar reward), it tells the model specifically how to be different next time.

Does training pause the agent's normal operation?

No. OpenClaw-RL uses an asynchronous design with three concurrent components: the agent serves live requests, the PRM judges ongoing interactions, and the trainer updates the policy simultaneously. Zero coordination overhead. The model keeps working while it's learning.

What types of interactions can OpenClaw-RL learn from?

All of them simultaneously — personal conversations (user re-queries, corrections, explicit feedback), terminal executions (command outputs, error logs), GUI interactions (UI state changes), software engineering tasks (test failures/passes), and tool-call traces. A single policy learns from all interaction types in the same loop.

How does OpenClaw-RL compare to standard RLHF?

Standard RLHF requires human labelers to manually score model outputs — expensive, slow, and static. OpenClaw-RL replaces human labelers entirely with automatic signal extraction from real interactions. It's also online (continuous updates during deployment) vs. RLHF's offline (periodic retraining on collected data). And it captures directive signals — how to improve — not just evaluative ones — whether it succeeded.

OpenClaw-RL: Train Any Agent Just by Using It

By Prahlad Menon Published 2026-03-15 3 min read

Every time you correct an AI agent, re-ask a question it got wrong, or watch a test fail because of something it wrote — that moment contains exactly the information needed to make it better. Current RL systems throw that information away.

OpenClaw-RL — from the paper “Train Any Agent Simply by Talking” (arXiv:2603.10165) — captures it. Every interaction, every correction, every failure becomes a live training signal. The model updates itself in the background while you use it. No human labelers. No separate training runs. No pausing.

The signal most RL systems ignore

Think about a student who gets an exam back, looks at the grade, and throws the paper away without reading the teacher’s notes. Standard reinforcement learning does exactly this. It sees: action taken → outcome (success/failure) and records a scalar reward. The teacher’s notes — the specific correction, the explanation of what went wrong, the instruction for what to do differently — are discarded.

The next-state signal is what OpenClaw-RL recovers. After every agent action, something happens:

A user replies (and their reply contains implicit or explicit feedback)
A terminal outputs a result or an error
A GUI changes state
A test passes or fails
A user asks the same question again (strong signal: the first answer didn’t satisfy them)

These next-state signals have been sitting in every deployed agent system, untouched, because no existing framework knew how to extract them as training data. OpenClaw-RL does.

Two signals from every interaction

OpenClaw-RL extracts two distinct types of learning signal from each next-state:

1. Evaluative signals — did it work?

These answer: how well did the action perform?

User asks the same question again → negative signal (they weren’t satisfied)
User says “thanks, that’s exactly right” → positive signal
Test passes → positive signal
Error log fires → negative signal

A Process Reward Model (PRM) judge converts these into scalar rewards automatically — no human scorer needed.

2. Directive signals — how should it have been different?

This is the teacher’s notes. These answer: in what specific direction should the action change?

User correction: “No, I meant X not Y” → extract what X should have been
Error message: “TypeError: expected int, got str” → extract the specific fix needed
Test failure output → extract what the expected behavior was

Through Hindsight-Guided On-Policy Distillation (OPD), OpenClaw-RL extracts textual hints from the next state, constructs an enhanced teacher context, and provides token-level directional advantage supervision. This is richer than any scalar reward — instead of “that was wrong (−1),” the model gets “that was wrong, and specifically here’s the direction you should have gone.”

The async training loop

Three things happen simultaneously, with zero coordination overhead:

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Agent         │     │   PRM Judge     │     │   Trainer       │
│   serves live   │────▶│   evaluates     │────▶│   updates       │
│   requests      │     │   interactions  │     │   policy        │
└─────────────────┘     └─────────────────┘     └─────────────────┘
         ▲                                               │
         └───────────────────────────────────────────────┘
                    updated policy weights

The agent never pauses. The judge never blocks the agent. The trainer never waits for a gap in traffic. Normal deployment is the training environment.

What it learns from

One policy, all interaction types simultaneously:

Interaction	Evaluative signal	Directive signal
Conversation	User re-query = failure; explicit thanks = success	User corrections, rephrasing
Terminal	Exit code 0 = success; error = failure	Error message content
GUI	Expected state reached = success	Unexpected UI state diff
SWE tasks	Test pass/fail	Test output, stack traces
Tool calls	Tool returned expected result	Tool error or unexpected output

The same infrastructure handles all of them. There’s no separate fine-tuning pipeline for “conversation agent” vs “coding agent” — the policy learns from every signal in the same loop.

How this connects to what we’ve been building

This paper lands at exactly the right moment for several threads we’ve been following:

Hindsight memory — Hindsight stores memories and builds mental models from experience. OpenClaw-RL goes further: it doesn’t just remember what happened, it actively trains the policy on it. The reflect() operation we’ve been planning for soul.py v3 is conceptually adjacent — synthesizing raw experience into improved behavior. OpenClaw-RL’s directive signals are the RL equivalent of that synthesis step.

Darwinian Evolver — Imbue’s system uses evolutionary selection to improve code over iterations. OpenClaw-RL uses RL to improve agent behavior over interactions. Same underlying idea — use the signal from each attempt to guide the next — applied to different problem domains (code optimization vs. general agent behavior).

RuVector — RuVector’s GNN improves retrieval quality over time as more data flows through it. OpenClaw-RL improves generation quality over time as more interactions flow through it. The self-improving infrastructure pattern is converging across every layer of the stack.

soul.py’s planned reflect() — The reflect() feature on our soul.py roadmap synthesizes raw memories into mental models. OpenClaw-RL’s directive signal extraction does something structurally similar — it synthesizes raw interaction outcomes into policy improvement signals. Different implementation, same core insight: the signal is in the feedback, not just the outcome.

The deployment shift

The traditional AI development cycle: collect data → label it → train → deploy → collect more data → repeat. Each step is expensive, slow, and disconnected from live user behavior.

OpenClaw-RL collapses this into: deploy → improve continuously from real interactions. The deployed model and the training model are the same model. The data pipeline is the deployment itself.

This is the direction the field is moving — away from periodic offline retraining toward continuous online learning from real-world signals. OpenClaw-RL is one of the cleanest implementations of this pattern we’ve seen, and it’s open-source.

Paper: arXiv:2603.10165
Code: github.com/Gen-Verse/OpenClaw-RL