What is the best AI memory framework in 2026?

Based on LoCoMo benchmark results, Xmem leads at 91.5% overall accuracy, followed by Memobase (75.8%), Zep (75.1%), and soul.py (70.0%). However, 'best' depends on your use case — conversational memory, knowledge retrieval, and relationship reasoning each favor different architectures.

How does soul.py compare to Mem0 and Zep?

soul.py's RLM mode scores 70.0% on LoCoMo, outperforming Mem0 (66.9%) and trailing Zep (75.1%). soul.py's key advantage is zero infrastructure — it runs on markdown files with no database required. Zep requires a hosted service, and Mem0 needs a vector database.

What is gbrain and how does its retrieval work?

gbrain is Garry Tan's personal knowledge management system. It uses a 4-layer retrieval pipeline: vector search (pgvector), BM25 keyword matching, reciprocal-rank fusion, and knowledge graph traversal. The graph alone adds +31 precision points over hybrid search without it.

What is the LoCoMo benchmark for AI memory?

LoCoMo (Long Conversation Memory) is a benchmark from Snap Research containing 1,986 questions across 10 long conversations. It tests four capabilities: single-hop recall, multi-hop reasoning, open-domain knowledge, and temporal understanding.

Why do AI agents struggle with temporal memory?

Temporal questions like 'when did the user change jobs?' require understanding time sequences, not just semantic similarity. Vector search returns thematically similar content regardless of chronology. Knowledge graphs and structured memory solve this by encoding temporal relationships as typed edges.

What is the difference between RAG and RLM for agent memory?

RAG (Retrieval-Augmented Generation) searches a vector database for semantically similar content. RLM (Retrieval via Language Model) uses the LLM itself to reason over stored memories. soul.py's benchmarks show RLM outperforms RAG by 4-7 points, especially on temporal and single-hop recall tasks.

How does a knowledge graph improve AI memory retrieval?

Knowledge graphs encode relationships between entities (person → invested_in → company → founded_in → 2024). This lets the system answer multi-hop questions like 'What companies did Bob invest in this quarter?' that vector search cannot solve. gbrain's benchmarks show the graph adds +31 precision points.

Can I use soul.py with OpenAI, Anthropic, or Gemini?

Yes. soul.py is provider-agnostic and supports OpenAI, Anthropic Claude, and Google Gemini via pip extras: pip install soul-agent[openai], soul-agent[anthropic], or soul-agent[gemini].

AI Memory Benchmark Showdown: How soul.py, gbrain, Xmem, and 6 Others Actually Compare

By Prahlad Menon Published 2026-05-19 1 min read

Your AI agent forgets everything the moment the conversation ends. You know this. I know this. Every framework claims to fix it. But which ones actually work?

I ran the numbers. Eight memory frameworks, one benchmark, and a deep dive into the most interesting retrieval architecture I’ve seen this year. Here’s what I found.

The Benchmark: LoCoMo

LoCoMo (Long Conversation Memory) from Snap Research is the standard benchmark for conversational AI memory. It throws 1,986 questions at your system across 10 long conversations, testing four capabilities:

Single-hop recall — “What’s the user’s dog’s name?”
Multi-hop reasoning — “What city does the user’s sister live in, and what restaurant did they mention there?”
Open-domain knowledge — Integrating memory with world knowledge
Temporal understanding — “When did the user change their mind about moving?”

Temporal is where frameworks go to die.

The Scoreboard

System	Overall	Single-hop	Multi-hop	Temporal	Architecture
Xmem	91.5%	—	—	—	Structured extraction + graph
Memobase	75.8%	—	—	—	Hierarchical memory
Zep	75.1%	—	—	—	Hosted service + knowledge graph
soul.py (RLM)	70.0%	54.1%	82.1%	40.0%	Markdown + RLM routing
Mem0	66.9%	—	—	—	Vector DB memory
soul.py (Hybrid)	65.6%	46.0%	79.5%	29.8%	RAG + RLM auto-routing
soul.py (RAG)	63.4%	36.5%	78.7%	27.0%	Qdrant vector search
LangMem	58.1%	—	—	—	LangChain memory
OpenAI	52.9%	—	—	—	Built-in conversation memory

Three things jump out:

1. The top tier is pulling away. Xmem at 91.5% is 16 points ahead of the next competitor. That’s not a marginal lead — that’s a generational gap.

2. RLM beats RAG by a wide margin. soul.py’s RLM mode (using the language model itself to reason over memories) scores 70.0% vs RAG’s 63.4%. The biggest gap? Temporal reasoning: 40% vs 27%. The LLM can reason about time sequences in ways that vector similarity cannot.

3. OpenAI’s built-in memory is last. At 52.9%, the thing most people actually use for “memory” is the worst option. It’s a conversation buffer with a token limit, not a memory system.

The Outlier: gbrain’s Retrieval Architecture

While benchmarking, I came across gbrain — Garry Tan’s personal knowledge management system. It wasn’t built for the same use case as the frameworks above, but its retrieval architecture is the most thoughtful I’ve seen. And it explains why the top-scoring systems win.

gbrain layers four retrieval strategies:

Layer 1: Vector Search (HNSW on pgvector)

Semantic similarity. “Who works on retrieval quality?” matches pages mentioning related concepts even without exact keyword overlap. Standard RAG stuff.

Layer 2: BM25 Keyword Search

Lexical matching. Catches names, code identifiers, exact phrases. The cases where vector search drifts into “thematically adjacent but wrong” territory.

Layer 3: Reciprocal-Rank Fusion (RRF)

Merges vector and keyword rankings without globally weighting one over the other. Each strategy votes. No tuning required.

Layer 4: Knowledge Graph Traversal

This is the load-bearing wall. The graph follows typed edges between entities:

Bob ── invested_in ──▶ Acme AI ── founded_in ──▶ 2024

“What did Bob invest in this quarter?” is a graph query. No amount of vector embedding tuning can answer it. The graph can.

gbrain’s BrainBench Results

Strategy	Precision@5	Recall@5
BM25 only	~18%	~75%
Vector only	~18%	~80%
Hybrid + RRF (no graph)	~18%	~85%
Full stack (with graph)	49.1%	97.9%

The graph alone adds +31 precision points. Without it, all three other strategies plateau at ~18% precision. That’s not a feature — it’s the architecture.

The Insight: Why Graphs Win at Memory

Here’s the pattern across the leaderboard:

Xmem (91.5%) — Extracts structured entities and relationships
Zep (75.1%) — Built-in knowledge graph
gbrain — Knowledge graph is the “load-bearing wall” (+31 P@5)
soul.py RLM (70.0%) — No graph, but the LLM itself does implicit relationship reasoning
Mem0, LangMem, OpenAI — No graph, no structured extraction, lowest scores

The systems that understand relationships between things outperform the systems that just find similar things.

Vector search answers: “Find me content that sounds like this query.”

Knowledge graphs answer: “Walk from entity A through relationship R to entity B.”

These are fundamentally different operations. The first is pattern matching. The second is reasoning.

Where Each Framework Fits

Not every project needs a knowledge graph. Here’s the honest breakdown:

Use soul.py when:

You want zero infrastructure (markdown files, no database)
You need provider-agnostic memory (OpenAI, Anthropic, Gemini)
Your agent has a persistent identity (SOUL.md + MEMORY.md pattern)
You’re building a side project or MVP that needs memory today

pip install soul-agent[anthropic]

from hybrid_agent import HybridAgent
agent = HybridAgent()
agent.ask("Remember: I'm allergic to penicillin")
# ... later, new session ...
agent.ask("What medications should I avoid?")
# → "You mentioned you're allergic to penicillin"

Use Xmem when:

Accuracy is everything (91.5% — nothing else comes close)
You can afford the extraction overhead
Your use case requires multi-hop reasoning

Use Zep when:

You want a hosted solution with built-in knowledge graph
Enterprise support matters
You need relationship-aware memory out of the box

Use gbrain’s approach when:

You’re building a personal knowledge base (not conversational memory)
You have thousands of documents with entity relationships
You want the best retrieval precision and can run pgvector

Avoid OpenAI’s built-in memory when:

You need anything beyond “remember the last 20 messages”
Temporal reasoning matters at all
You’re building for production

The Gap: Temporal Reasoning

Every framework struggles with temporal questions. soul.py’s best temporal score is 40% (RLM mode). Xmem likely handles this better through structured extraction, but temporal reasoning remains the frontier.

Why is it so hard? Consider: “When did the user change their mind about moving to Austin?”

To answer this, a memory system needs to:

Find the first mention of moving to Austin (positive sentiment)
Find the later mention (negative sentiment or cancellation)
Identify the transition point between them
Return the timestamp of that transition

Vector search returns both mentions. It can’t sequence them. BM25 finds the keyword “Austin” everywhere. Neither can identify the change.

A knowledge graph with temporal edges could model this:

User ── planned_move ──▶ Austin [date: March 2026]
User ── cancelled_move ──▶ Austin [date: April 2026]

This is where the next wave of memory frameworks will compete.

What’s Next for soul.py

Based on this analysis, three improvements would move soul.py from 70% toward the 80%+ tier:

Lightweight entity graph — Regex-extracted from MEMORY.md on every write (gbrain’s approach: zero LLM cost, three regexes, grows on every save)
Graph traversal for multi-hop queries — Walk typed edges for relationship questions instead of relying on semantic similarity
Cross-encoder reranking — A 150ms post-retrieval pass that reshuffles results with full query-document attention

The beauty of soul.py’s architecture is that none of these require abandoning the markdown-native philosophy. The graph can be extracted from the same MEMORY.md files. No pgvector. No hosted service. Just smarter retrieval over the same simple files.

The Bottom Line

AI memory in 2026 is a solved problem at the architectural level — the top systems prove it works. It’s an unsolved problem at the practical level — most developers are still using OpenAI’s built-in memory (52.9%) or no memory at all.

The leaderboard tells a clear story: structure beats similarity. Systems that extract entities, build relationships, and traverse graphs outperform systems that embed everything into vectors and hope for the best.

The question isn’t whether your agent needs memory. It’s whether your memory system understands relationships or just recognizes vibes.

soul.py is open source (MIT). Benchmarks run on LoCoMo from Snap Research. Full results and methodology at menonpg.github.io/soul-benchmarks. gbrain retrieval architecture documented at github.com/garrytan/gbrain.