MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

MEMLENS is a new benchmark designed to systematically compare multimodal long-term memory capabilities in Large Vision-Language Models (LVLMs) and memory-augmented agents. It features 789 questions across five memory abilities: information extraction, multi-session reasoning, temporal reasoning, knowledge update, and answer refusal. The benchmark evaluates models at four context lengths (32K-256K tokens) using a cross-modal token-counting scheme. An image-ablation study confirmed that 80.4% of MEMLENS questions require visual evidence, with two frontier LVLMs dropping below 2% accuracy without images. Evaluation of 27 LVLMs and 7 memory-augmented agents revealed that long-context LVLMs excel in short-context visual grounding but degrade with longer conversations, while memory agents maintain length stability but lose visual fidelity due to compression. Multi-session reasoning proved challenging, with most systems scoring below 30%.

Key takeaway

For AI Engineers developing multimodal systems, MEMLENS highlights the need for hybrid architectures combining long-context attention with structured multimodal retrieval. You should focus on mitigating the degradation of long-context LVLMs in extended conversations and improving visual fidelity in memory-augmented agents to achieve robust long-term memory capabilities.

Key insights

MEMLENS benchmarks multimodal long-term memory, revealing distinct strengths and weaknesses in LVLMs and memory-augmented agents.

Principles

Method

MEMLENS evaluates five memory abilities across four context lengths using a cross-modal token-counting scheme, confirmed by image-ablation studies.

In practice

Topics

Code references

Best for: AI Engineer, Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.