ObjChangeVR: Object State Change Reasoning from Continuous Egocentric Views in VR Environments

2026-03-10 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Advanced, extended

Summary

ObjChangeVR is a novel framework and dataset designed to address the challenging task of object state change reasoning from continuous egocentric views in virtual reality (VR) environments. Developed by researchers from Kennesaw State University and Pennsylvania State University, ObjChangeVR-Dataset provides a benchmark for natural language-based question-answering about object state changes, particularly those occurring in the background without direct user interaction or explicit motion cues. The framework combines viewpoint-aware and temporal-based retrieval to identify relevant frames from lengthy egocentric video sequences, along with a cross-view reasoning module that reconciles inconsistent evidence from multiple viewpoints. Experiments using MLLMs like GPT-4o, GPT-4o mini, and Gemini 2.0 Flash demonstrate that ObjChangeVR significantly outperforms baseline approaches, achieving an overall average EM@0.8 of 0.754 with GPT-4o, and showing greater benefits for smaller MLLMs.

Key takeaway

For AI Scientists and Research Scientists developing MLLM-based VR applications, ObjChangeVR offers a robust framework for detecting subtle object state changes. You should consider integrating viewpoint metadata and a multi-stage reasoning pipeline to improve accuracy, especially for background changes lacking direct interaction cues. Optimizing the number of retrieved frames, such as using $k=3$, can balance contextual richness with reduced inference latency and token consumption, leading to more reliable and efficient VR scene understanding.

Key insights

ObjChangeVR enhances MLLM performance in VR object change detection by integrating viewpoint-aware retrieval and cross-view temporal reasoning.

Principles

Viewpoint metadata improves relevant frame retrieval in VR.
Cross-view reasoning reconciles inconsistent observations.
Temporal progression cues aid object state change detection.

Method

ObjChangeVR uses a three-stage hierarchical filtering (position, orientation, temporal) for frame retrieval, followed by a two-stage chain-of-thought prompting for cross-view and temporal reasoning to derive final answers.

In practice

Utilize 6-DoF camera pose data for enhanced scene understanding.
Implement multi-frame reasoning to overcome occlusions and varied viewpoints.
Prioritize $k=3$ retrieved frames for optimal performance and efficiency.

Topics

Multimodal Large Language Models
Virtual Reality
Object State Change Detection
Egocentric Vision
Visual Question Answering

Code references

sding11/ObjChangeVR

Best for: AI Scientist, Research Scientist, AI Researcher, AI Engineer, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.