AI Memory Benchmark Showdown: How soul.py, gbrain, Xmem, and 6 Others Actually Compare
Your AI agent forgets everything the moment the conversation ends. You know this. I know this. Every framework claims to fix it. But which ones actually work?
I ran the numbers. Eight memory frameworks, one benchmark, and a deep dive into the most interesting retrieval architecture I’ve seen this year. Here’s what I found.
The Benchmark: LoCoMo
LoCoMo (Long Conversation Memory) from Snap Research is the standard benchmark for conversational AI memory. It throws 1,986 questions at your system across 10 long conversations, testing four capabilities:
- Single-hop recall — “What’s the user’s dog’s name?”
- Multi-hop reasoning — “What city does the user’s sister live in, and what restaurant did they mention there?”
- Open-domain knowledge — Integrating memory with world knowledge
- Temporal understanding — “When did the user change their mind about moving?”
Temporal is where frameworks go to die.
The Scoreboard
| System | Overall | Single-hop | Multi-hop | Temporal | Architecture |
|---|---|---|---|---|---|
| Xmem | 91.5% | — | — | — | Structured extraction + graph |
| Memobase | 75.8% | — | — | — | Hierarchical memory |
| Zep | 75.1% | — | — | — | Hosted service + knowledge graph |
| soul.py (RLM) | 70.0% | 54.1% | 82.1% | 40.0% | Markdown + RLM routing |
| Mem0 | 66.9% | — | — | — | Vector DB memory |
| soul.py (Hybrid) | 65.6% | 46.0% | 79.5% | 29.8% | RAG + RLM auto-routing |
| soul.py (RAG) | 63.4% | 36.5% | 78.7% | 27.0% | Qdrant vector search |
| LangMem | 58.1% | — | — | — | LangChain memory |
| OpenAI | 52.9% | — | — | — | Built-in conversation memory |
Three things jump out:
1. The top tier is pulling away. Xmem at 91.5% is 16 points ahead of the next competitor. That’s not a marginal lead — that’s a generational gap.
2. RLM beats RAG by a wide margin. soul.py’s RLM mode (using the language model itself to reason over memories) scores 70.0% vs RAG’s 63.4%. The biggest gap? Temporal reasoning: 40% vs 27%. The LLM can reason about time sequences in ways that vector similarity cannot.
3. OpenAI’s built-in memory is last. At 52.9%, the thing most people actually use for “memory” is the worst option. It’s a conversation buffer with a token limit, not a memory system.
The Outlier: gbrain’s Retrieval Architecture
While benchmarking, I came across gbrain — Garry Tan’s personal knowledge management system. It wasn’t built for the same use case as the frameworks above, but its retrieval architecture is the most thoughtful I’ve seen. And it explains why the top-scoring systems win.
gbrain layers four retrieval strategies:
Layer 1: Vector Search (HNSW on pgvector)
Semantic similarity. “Who works on retrieval quality?” matches pages mentioning related concepts even without exact keyword overlap. Standard RAG stuff.
Layer 2: BM25 Keyword Search
Lexical matching. Catches names, code identifiers, exact phrases. The cases where vector search drifts into “thematically adjacent but wrong” territory.
Layer 3: Reciprocal-Rank Fusion (RRF)
Merges vector and keyword rankings without globally weighting one over the other. Each strategy votes. No tuning required.
Layer 4: Knowledge Graph Traversal
This is the load-bearing wall. The graph follows typed edges between entities:
Bob ── invested_in ──▶ Acme AI ── founded_in ──▶ 2024
“What did Bob invest in this quarter?” is a graph query. No amount of vector embedding tuning can answer it. The graph can.
gbrain’s BrainBench Results
| Strategy | Precision@5 | Recall@5 |
|---|---|---|
| BM25 only | ~18% | ~75% |
| Vector only | ~18% | ~80% |
| Hybrid + RRF (no graph) | ~18% | ~85% |
| Full stack (with graph) | 49.1% | 97.9% |
The graph alone adds +31 precision points. Without it, all three other strategies plateau at ~18% precision. That’s not a feature — it’s the architecture.
The Insight: Why Graphs Win at Memory
Here’s the pattern across the leaderboard:
- Xmem (91.5%) — Extracts structured entities and relationships
- Zep (75.1%) — Built-in knowledge graph
- gbrain — Knowledge graph is the “load-bearing wall” (+31 P@5)
- soul.py RLM (70.0%) — No graph, but the LLM itself does implicit relationship reasoning
- Mem0, LangMem, OpenAI — No graph, no structured extraction, lowest scores
The systems that understand relationships between things outperform the systems that just find similar things.
Vector search answers: “Find me content that sounds like this query.”
Knowledge graphs answer: “Walk from entity A through relationship R to entity B.”
These are fundamentally different operations. The first is pattern matching. The second is reasoning.
Where Each Framework Fits
Not every project needs a knowledge graph. Here’s the honest breakdown:
Use soul.py when:
- You want zero infrastructure (markdown files, no database)
- You need provider-agnostic memory (OpenAI, Anthropic, Gemini)
- Your agent has a persistent identity (SOUL.md + MEMORY.md pattern)
- You’re building a side project or MVP that needs memory today
pip install soul-agent[anthropic]
from hybrid_agent import HybridAgent
agent = HybridAgent()
agent.ask("Remember: I'm allergic to penicillin")
# ... later, new session ...
agent.ask("What medications should I avoid?")
# → "You mentioned you're allergic to penicillin"
Use Xmem when:
- Accuracy is everything (91.5% — nothing else comes close)
- You can afford the extraction overhead
- Your use case requires multi-hop reasoning
Use Zep when:
- You want a hosted solution with built-in knowledge graph
- Enterprise support matters
- You need relationship-aware memory out of the box
Use gbrain’s approach when:
- You’re building a personal knowledge base (not conversational memory)
- You have thousands of documents with entity relationships
- You want the best retrieval precision and can run pgvector
Avoid OpenAI’s built-in memory when:
- You need anything beyond “remember the last 20 messages”
- Temporal reasoning matters at all
- You’re building for production
The Gap: Temporal Reasoning
Every framework struggles with temporal questions. soul.py’s best temporal score is 40% (RLM mode). Xmem likely handles this better through structured extraction, but temporal reasoning remains the frontier.
Why is it so hard? Consider: “When did the user change their mind about moving to Austin?”
To answer this, a memory system needs to:
- Find the first mention of moving to Austin (positive sentiment)
- Find the later mention (negative sentiment or cancellation)
- Identify the transition point between them
- Return the timestamp of that transition
Vector search returns both mentions. It can’t sequence them. BM25 finds the keyword “Austin” everywhere. Neither can identify the change.
A knowledge graph with temporal edges could model this:
User ── planned_move ──▶ Austin [date: March 2026]
User ── cancelled_move ──▶ Austin [date: April 2026]
This is where the next wave of memory frameworks will compete.
What’s Next for soul.py
Based on this analysis, three improvements would move soul.py from 70% toward the 80%+ tier:
- Lightweight entity graph — Regex-extracted from MEMORY.md on every write (gbrain’s approach: zero LLM cost, three regexes, grows on every save)
- Graph traversal for multi-hop queries — Walk typed edges for relationship questions instead of relying on semantic similarity
- Cross-encoder reranking — A 150ms post-retrieval pass that reshuffles results with full query-document attention
The beauty of soul.py’s architecture is that none of these require abandoning the markdown-native philosophy. The graph can be extracted from the same MEMORY.md files. No pgvector. No hosted service. Just smarter retrieval over the same simple files.
The Bottom Line
AI memory in 2026 is a solved problem at the architectural level — the top systems prove it works. It’s an unsolved problem at the practical level — most developers are still using OpenAI’s built-in memory (52.9%) or no memory at all.
The leaderboard tells a clear story: structure beats similarity. Systems that extract entities, build relationships, and traverse graphs outperform systems that embed everything into vectors and hope for the best.
The question isn’t whether your agent needs memory. It’s whether your memory system understands relationships or just recognizes vibes.
soul.py is open source (MIT). Benchmarks run on LoCoMo from Snap Research. Full results and methodology at menonpg.github.io/soul-benchmarks. gbrain retrieval architecture documented at github.com/garrytan/gbrain.