soul.py LoCoMo Benchmark Results Now on HuggingFace
We’ve published the full LoCoMo benchmark results for soul.py to HuggingFace: pgmenon/soul-benchmarks-locomo.
The dataset contains per-question scores across all 1,986 questions in the LoCoMo benchmark, tested against five retrieval configurations. Anyone can download it, reproduce our numbers, or use it as a baseline for their own memory system comparisons.
What We Benchmarked
LoCoMo is a benchmark from Snap Research designed to test long-conversation memory. It covers four question categories: single-hop recall, multi-hop reasoning, open-domain questions, and temporal reasoning. We ran all five soul.py retrieval modes against the full test set:
| Configuration | Overall | Single-Hop | Multi-Hop | Open-Domain | Temporal |
|---|---|---|---|---|---|
| RLM | 70.0% | 54.1% | 82.1% | 55.1% | 40.0% |
| Hybrid | 65.6% | — | — | — | — |
| Auto | 64.1% | — | — | — | — |
| Qdrant (RAG) | 63.4% | — | — | — | — |
| BM25 | 63.1% | — | — | — | — |
RLM (Reflective Layered Memory) is the clear winner internally, adding +7 points over pure vector RAG. The multi-hop score of 82.1% is particularly strong — questions that require synthesizing information across multiple conversation turns are exactly where layered reflection pays off.
Honest Comparison With the Field
We’re not going to pretend these numbers make soul.py state-of-the-art. Here’s where things stand:
| System | Overall Score | Architecture |
|---|---|---|
| XMem | 91.5% | Multi-agent + structured extraction |
| Memobase | 75.8% | User profiling + structured memory |
| Zep | 75.1% | Knowledge graph + temporal |
| soul.py (RLM) | 70.0% | RAG + RLM, zero deps |
| Mem0 | 62.0% | Vector store |
| LangMem | 57.2% | LangChain memory |
| OpenAI Threads | 51.8% | Built-in context |
soul.py beats Mem0, LangMem, and OpenAI’s built-in memory — but trails XMem, Zep, and Memobase by meaningful margins. The gap is real, and it tells us something important about what’s missing.
What the Gap Tells Us
The systems above soul.py all share one feature: structured entity extraction. XMem extracts entities and relationships into a graph. Zep builds a knowledge graph with temporal edges. Memobase creates structured user profiles.
soul.py currently does none of that. All retrieval is over unstructured text — either via BM25 keyword matching, vector similarity, or RLM’s reflective layers. The fact that we hit 70% without any structured extraction suggests the foundation is solid. But to close the gap, we need to add structure.
Our analysis with gbrain showed that adding a lightweight entity graph yielded +31 precision points on targeted queries. That’s the signal we’re following.
Multi-Hop Is Our Strength
The 82.1% multi-hop score deserves attention. Multi-hop questions require the system to find and connect information scattered across different parts of a conversation — “What restaurant did they go to on the trip they planned in March?” requires linking a trip, a date, and a restaurant across potentially thousands of tokens of conversation.
RLM’s layered reflection architecture is purpose-built for this. By creating progressively higher abstractions over raw conversation data, it naturally connects related information that pure vector search would retrieve as disconnected chunks.
Explore the Data
- HuggingFace Dataset: pgmenon/soul-benchmarks-locomo — full per-question results, downloadable
- Interactive Dashboard: menonpg.github.io/soul-benchmarks — filter by category, compare configs
- Detailed Analysis: AI Memory Benchmark Showdown — deep dive into what the numbers mean
What’s Next
Three features on the roadmap directly target the gaps these benchmarks revealed:
-
Lightweight Entity Graph (
mode="graph") — Regex-based entity and relationship extraction on every memory write. Zero LLM cost, graph traversal for multi-hop and relationship queries. This is the single biggest lever for closing the gap to XMem/Zep. -
Cross-Encoder Reranking (
mode="rerank") — A ~150ms post-retrieval pass that reshuffles results using full query-document attention. Should improve precision without adding infrastructure. -
Temporal Reasoning — Temporal edges in the entity graph for tracking state changes over time. Our weakest category at 40% vs XMem’s 91.9%.
All three are designed to work within soul.py’s zero-dependency philosophy — no external services, no Docker compose files, just pip install soul-agent and go.
Try It
pip install soul-agent
- GitHub: menonpg/soul.py
- PyPI: soul-agent
- arXiv: 2604.09588