Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

2026-06-17 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

RNG-Bench (Reconstructive Non-Markov Games) is a new benchmark suite evaluating Multimodal Large Language Models' (MLLMs) ability to reconstruct and act on past, hidden observations during multi-step interactions. It addresses limitations of existing benchmarks by isolating hidden-state reconstruction. The suite includes Matching Pairs, for card identity recall, and 3D Maze, for spatial map integration from egocentric views. Evaluation uses a unified harness with controlled difficulty axes: grid size, visual pattern, and observation modality, alongside a head-to-head duel protocol and a Memory Gap metric. Hardest configurations, requiring 128K tokens and 350 image inputs, are not saturated by frontier MLLMs. Memory Gap analysis shows errors primarily from forgetting earlier observations. Fine-tuning Qwen3.5-9B on optimal-policy rollouts and filtered demonstrations improves RNG-Bench performance and transfers to other benchmarks without degrading general multimodal capability.

Key takeaway

For AI Scientists and Machine Learning Engineers evaluating MLLM performance in complex, non-Markovian environments, recognize that current models often struggle with reconstructing and acting on past observations. You should consider utilizing benchmarks like RNG-Bench to specifically diagnose hidden-state reconstruction limitations. Implement fine-tuning strategies, such as those based on optimal-policy rollouts and filtered demonstrations, to enhance your MLLMs' long-term memory and improve their performance across diverse multimodal tasks.

Key insights

MLLMs struggle with reconstructing and acting on past, hidden observations in non-Markovian games.

Principles

Existing MLLM benchmarks often conflate hidden-state reconstruction with other agent skills.
Forgetting earlier observations is a primary source of MLLM errors in non-Markov games.
Fine-tuning MLLMs on optimal-policy rollouts can improve memory and transferability.

Method

RNG-Bench evaluates MLLMs using Matching Pairs and 3D Maze games, controlled by grid size, visual pattern, and observation modality. It employs a head-to-head duel and Memory Gap metric.

In practice

Use RNG-Bench to specifically evaluate MLLM hidden-state reconstruction capabilities.
Employ the Memory Gap metric to diagnose MLLM forgetting issues versus poor action selection.
Consider fine-tuning MLLMs with optimal-policy rollouts to enhance long-term observational memory.

Topics

Multimodal Large Language Models
Non-Markov Games
RNG-Bench
Hidden State Reconstruction
MLLM Benchmarking
Model Fine-tuning

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.