soul.py LoCoMo Benchmark Results Now on HuggingFace

By Prahlad Menon Published 2026-05-19 3 min read

We’ve published the full LoCoMo benchmark results for soul.py to HuggingFace: pgmenon/soul-benchmarks-locomo.

The dataset contains per-question scores across all 1,986 questions in the LoCoMo benchmark, tested against five retrieval configurations. Anyone can download it, reproduce our numbers, or use it as a baseline for their own memory system comparisons.

What We Benchmarked

LoCoMo is a benchmark from Snap Research designed to test long-conversation memory. It covers four question categories: single-hop recall, multi-hop reasoning, open-domain questions, and temporal reasoning. We ran all five soul.py retrieval modes against the full test set:

Configuration	Overall	Single-Hop	Multi-Hop	Open-Domain	Temporal
RLM	70.0%	54.1%	82.1%	55.1%	40.0%
Hybrid	65.6%	—	—	—	—
Auto	64.1%	—	—	—	—
Qdrant (RAG)	63.4%	—	—	—	—
BM25	63.1%	—	—	—	—

RLM (Reflective Layered Memory) is the clear winner internally, adding +7 points over pure vector RAG. The multi-hop score of 82.1% is particularly strong — questions that require synthesizing information across multiple conversation turns are exactly where layered reflection pays off.

Honest Comparison With the Field

We’re not going to pretend these numbers make soul.py state-of-the-art. Here’s where things stand:

System	Overall Score	Architecture
XMem	91.5%	Multi-agent + structured extraction
Memobase	75.8%	User profiling + structured memory
Zep	75.1%	Knowledge graph + temporal
soul.py (RLM)	70.0%	RAG + RLM, zero deps
Mem0	62.0%	Vector store
LangMem	57.2%	LangChain memory
OpenAI Threads	51.8%	Built-in context

soul.py beats Mem0, LangMem, and OpenAI’s built-in memory — but trails XMem, Zep, and Memobase by meaningful margins. The gap is real, and it tells us something important about what’s missing.

What the Gap Tells Us

The systems above soul.py all share one feature: structured entity extraction. XMem extracts entities and relationships into a graph. Zep builds a knowledge graph with temporal edges. Memobase creates structured user profiles.

soul.py currently does none of that. All retrieval is over unstructured text — either via BM25 keyword matching, vector similarity, or RLM’s reflective layers. The fact that we hit 70% without any structured extraction suggests the foundation is solid. But to close the gap, we need to add structure.

Our analysis with gbrain showed that adding a lightweight entity graph yielded +31 precision points on targeted queries. That’s the signal we’re following.

Multi-Hop Is Our Strength

The 82.1% multi-hop score deserves attention. Multi-hop questions require the system to find and connect information scattered across different parts of a conversation — “What restaurant did they go to on the trip they planned in March?” requires linking a trip, a date, and a restaurant across potentially thousands of tokens of conversation.

RLM’s layered reflection architecture is purpose-built for this. By creating progressively higher abstractions over raw conversation data, it naturally connects related information that pure vector search would retrieve as disconnected chunks.

Explore the Data

HuggingFace Dataset: pgmenon/soul-benchmarks-locomo — full per-question results, downloadable
Interactive Dashboard: menonpg.github.io/soul-benchmarks — filter by category, compare configs
Detailed Analysis: AI Memory Benchmark Showdown — deep dive into what the numbers mean

What’s Next

Three features on the roadmap directly target the gaps these benchmarks revealed:

Lightweight Entity Graph (mode="graph") — Regex-based entity and relationship extraction on every memory write. Zero LLM cost, graph traversal for multi-hop and relationship queries. This is the single biggest lever for closing the gap to XMem/Zep.
Cross-Encoder Reranking (mode="rerank") — A ~150ms post-retrieval pass that reshuffles results using full query-document attention. Should improve precision without adding infrastructure.
Temporal Reasoning — Temporal edges in the entity graph for tracking state changes over time. Our weakest category at 40% vs XMem’s 91.9%.

All three are designed to work within soul.py’s zero-dependency philosophy — no external services, no Docker compose files, just pip install soul-agent and go.

Try It

pip install soul-agent

GitHub: menonpg/soul.py
PyPI: soul-agent
arXiv: 2604.09588