#1 on memory benchmark LongMemEval with Gemini Flash, not Pro [R]

2026-05-17 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Advanced, quick

Summary

An experimental memory retrieval system achieved a 96.4% score at top-50 on the LongMemEval benchmark using Gemini 3 Flash, outperforming reported scores of systems like Mem0 (94.8%), Honcho (92.6%), HydraDB (90.79%), and Supermemory (85.2%), all of which used Gemini 3 Pro. The system's architecture is informed by episodic memory theory, reconstructive recall, and temporal context models. Key design choices include query decomposition for parallel retrieval, temporal salience scoring combining semantic similarity, lexical precision, and recency, and coherence re-ranking. The evaluation used a forked Mem0 benchmarking script with a single generic prompt across 500 questions, deliberately employing a smaller answering model to isolate retrieval quality. Category results ranged from 94.0% for multi-session to 100% for assistant queries.

Key takeaway

For NLP Engineers developing conversational AI, consider integrating cognitive science principles into your retrieval architecture. The demonstrated performance gains from query decomposition, temporal salience scoring, and coherence re-ranking suggest these methods can significantly improve memory recall, even with smaller answering models. You should also be aware of potential evaluation ceiling effects and benchmark inconsistencies when reaching high accuracy scores.

Key insights

Cognitive science-informed retrieval architectures can significantly enhance conversational memory performance.

Principles

Isolate retrieval quality from model capability.
Decompose queries for multi-session contexts.
Score candidates on temporal salience.

Method

The system uses query decomposition, temporal salience scoring, and coherence re-ranking, drawing on episodic memory theory and reconstructive recall, to improve memory retrieval for conversational AI.

In practice

Implement query decomposition for complex queries.
Incorporate temporal factors in retrieval scoring.
Re-rank results for cross-memory coherence.

Topics

LongMemEval Benchmark
Gemini Flash
Memory Retrieval Systems
Episodic Memory Theory
Query Decomposition

Best for: Research Scientist, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.