Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games
Summary
RNG-Bench (Reconstructive Non-Markov Games) is a new benchmark suite evaluating Multimodal Large Language Models' (MLLMs) ability to reconstruct and act on past, hidden observations during multi-step interactions. It addresses limitations of existing benchmarks by isolating hidden-state reconstruction. The suite includes Matching Pairs, for card identity recall, and 3D Maze, for spatial map integration from egocentric views. Evaluation uses a unified harness with controlled difficulty axes: grid size, visual pattern, and observation modality, alongside a head-to-head duel protocol and a Memory Gap metric. Hardest configurations, requiring 128K tokens and 350 image inputs, are not saturated by frontier MLLMs. Memory Gap analysis shows errors primarily from forgetting earlier observations. Fine-tuning Qwen3.5-9B on optimal-policy rollouts and filtered demonstrations improves RNG-Bench performance and transfers to other benchmarks without degrading general multimodal capability.
Key takeaway
For AI Scientists and Machine Learning Engineers evaluating MLLM performance in complex, non-Markovian environments, recognize that current models often struggle with reconstructing and acting on past observations. You should consider utilizing benchmarks like RNG-Bench to specifically diagnose hidden-state reconstruction limitations. Implement fine-tuning strategies, such as those based on optimal-policy rollouts and filtered demonstrations, to enhance your MLLMs' long-term memory and improve their performance across diverse multimodal tasks.
Key insights
MLLMs struggle with reconstructing and acting on past, hidden observations in non-Markovian games.
Principles
- Existing MLLM benchmarks often conflate hidden-state reconstruction with other agent skills.
- Forgetting earlier observations is a primary source of MLLM errors in non-Markov games.
- Fine-tuning MLLMs on optimal-policy rollouts can improve memory and transferability.
Method
RNG-Bench evaluates MLLMs using Matching Pairs and 3D Maze games, controlled by grid size, visual pattern, and observation modality. It employs a head-to-head duel and Memory Gap metric.
In practice
- Use RNG-Bench to specifically evaluate MLLM hidden-state reconstruction capabilities.
- Employ the Memory Gap metric to diagnose MLLM forgetting issues versus poor action selection.
- Consider fine-tuning MLLMs with optimal-policy rollouts to enhance long-term observational memory.
Topics
- Multimodal Large Language Models
- Non-Markov Games
- RNG-Bench
- Hidden State Reconstruction
- MLLM Benchmarking
- Model Fine-tuning
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.