Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation
Summary
Researchers from Tsinghua University and Xiaomi Group introduce Interleaved Vision–Language Reasoning (IVLR), a novel policy framework for long-horizon robotic manipulation. IVLR utilizes an explicit intermediate representation called IVLR-Trace, which alternates textual subgoals with visual keyframes across the entire task. Unlike existing Vision-Language-Action (VLA) policies that often hide planning or use single-modality reasoning, IVLR generates a complete semantic-geometric trace from the initial observation and instruction, caches it, and then conditions a closed-loop action decoder on this trace, the original instruction, and the current observation. To train this system, a pseudo-supervision pipeline was developed to construct traces by segmenting demonstrations and captioning each stage with a vision-language model. IVLR achieved 95.5% average success on LIBERO, including 92.4% on LIBERO-Long, and 59.4% overall success on SimplerEnv-WidowX, demonstrating significant improvements in long-horizon tasks and robustness to visual distribution shifts.
Key takeaway
For research scientists developing advanced robot manipulation policies, IVLR offers a compelling approach to address long-horizon tasks. You should consider implementing explicit interleaved vision-language traces to improve causal coherence and geometric grounding, especially for complex, multi-stage operations. While initial trace generation introduces latency, the enhanced robustness to execution perturbations and visual shifts, along with improved success rates on long-horizon benchmarks, suggests a valuable trade-off for static environments. Future work should explore dynamic replanning to overcome limitations in changing scenes.
Key insights
Interleaving textual subgoals with visual keyframes provides robust semantic-geometric context for long-horizon robot manipulation.
Principles
- Explicit multimodal traces enhance causal coherence and geometric grounding.
- Both text and visual modalities are necessary for robust long-horizon planning.
- Pseudo-supervision can generate training data for complex multimodal representations.
Method
A single native multimodal transformer generates a full-horizon IVLR-Trace (textual subgoals + visual keyframes) from initial observation/instruction, then caches it to condition a closed-loop action decoder.
In practice
- Use UVD for temporal segmentation of robot demonstrations.
- Employ a VLM (e.g., Qwen3-VL) to caption segmented stages.
- Integrate generated traces as cached context for action decoding.
Topics
- Interleaved Vision-Language Reasoning
- IVLR-Trace
- Long-Horizon Robot Manipulation
- Multimodal Transformers
- Pseudo-Trace Construction
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.