Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation

2026-05-04 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

Researchers from Tsinghua University and Xiaomi Group introduce Interleaved Vision–Language Reasoning (IVLR), a novel policy framework for long-horizon robotic manipulation. IVLR utilizes an explicit intermediate representation called IVLR-Trace, which alternates textual subgoals with visual keyframes across the entire task. Unlike existing Vision-Language-Action (VLA) policies that often hide planning or use single-modality reasoning, IVLR generates a complete semantic-geometric trace from the initial observation and instruction, caches it, and then conditions a closed-loop action decoder on this trace, the original instruction, and the current observation. To train this system, a pseudo-supervision pipeline was developed to construct traces by segmenting demonstrations and captioning each stage with a vision-language model. IVLR achieved 95.5% average success on LIBERO, including 92.4% on LIBERO-Long, and 59.4% overall success on SimplerEnv-WidowX, demonstrating significant improvements in long-horizon tasks and robustness to visual distribution shifts.

Key takeaway

For research scientists developing advanced robot manipulation policies, IVLR offers a compelling approach to address long-horizon tasks. You should consider implementing explicit interleaved vision-language traces to improve causal coherence and geometric grounding, especially for complex, multi-stage operations. While initial trace generation introduces latency, the enhanced robustness to execution perturbations and visual shifts, along with improved success rates on long-horizon benchmarks, suggests a valuable trade-off for static environments. Future work should explore dynamic replanning to overcome limitations in changing scenes.

Key insights

Interleaving textual subgoals with visual keyframes provides robust semantic-geometric context for long-horizon robot manipulation.

Principles

Explicit multimodal traces enhance causal coherence and geometric grounding.
Both text and visual modalities are necessary for robust long-horizon planning.
Pseudo-supervision can generate training data for complex multimodal representations.

Method

A single native multimodal transformer generates a full-horizon IVLR-Trace (textual subgoals + visual keyframes) from initial observation/instruction, then caches it to condition a closed-loop action decoder.

In practice

Use UVD for temporal segmentation of robot demonstrations.
Employ a VLM (e.g., Qwen3-VL) to caption segmented stages.
Integrate generated traces as cached context for action decoding.

Topics

Interleaved Vision-Language Reasoning
IVLR-Trace
Long-Horizon Robot Manipulation
Multimodal Transformers
Pseudo-Trace Construction

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.