KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis
Summary
KITE is a training-free, keyframe-anchored, layout-grounded front-end designed to convert lengthy robot-execution videos into compact, interpretable tokenized evidence for vision-language models (VLMs). Released on April 8, 2026, KITE distills robot trajectories into motion-salient keyframes with open-vocabulary detections, pairing each keyframe with a schematic bird's-eye-view (BEV) representation that encodes relative object layout, axes, timestamps, and detection confidence. These visual cues are serialized with robot-profile and scene-context tokens into a unified prompt, enabling an off-the-shelf VLM to perform failure detection, identification, localization, explanation, and correction. On the RoboFAC benchmark, KITE, when combined with Qwen2.5-VL, significantly outperforms vanilla Qwen2.5-VL in training-free settings, particularly in simulation failure detection, identification, and localization, while remaining competitive with RoboFAC-tuned baselines. A small QLoRA fine-tune further enhances explanation and correction quality, with qualitative results on real dual-arm robots demonstrating its practical applicability.
Key takeaway
For research scientists developing robust robotic systems, KITE offers a structured, interpretable front-end for VLM-based robot failure analysis. You should consider integrating KITE to convert long robot execution videos into compact, tokenized evidence, enabling more effective failure detection, identification, localization, explanation, and correction, especially in training-free scenarios. This approach can significantly enhance diagnostic capabilities for both simulated and real-world robotic deployments.
Key insights
KITE transforms robot videos into tokenized, keyframe-indexed evidence for VLMs, improving failure analysis without extensive training.
Principles
- Distill long videos into motion-salient keyframes.
- Encode object layout and context in BEV representations.
- Unify visual and contextual tokens into a single VLM prompt.
Method
KITE selects motion-salient keyframes, generates bird's-eye-view representations with object layouts and metadata, and serializes these with robot-profile and scene-context tokens into a unified prompt for VLM processing.
In practice
- Use KITE for training-free robot failure analysis.
- Apply QLoRA fine-tuning for improved VLM explanation.
- Integrate KITE with off-the-shelf VLMs like Qwen2.5-VL.
Topics
- KITE Framework
- Robot Failure Analysis
- Vision-Language Models
- Keyframe Extraction
- Bird's-Eye-View Representation
Code references
Best for: Research Scientist, AI Scientist, Robotics Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.