InSight: Self-Guided Skill Acquisition via Steerable VLAs

2026-06-23 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, medium

Summary

The InSight framework introduces a method for autonomous skill acquisition in Vision-Language-Action (VLA) models by enabling primitive-action level steerability. This framework addresses the limitation of VLA models being bounded by their training data. InSight operates in two stages: first, an automated segmentation pipeline labels primitive actions from demonstrations using VLM plan decomposition and end-effector poses. Second, a VLM-guided data flywheel identifies missing primitives for new tasks, autonomously generates demonstrations with VLM-proposed low-level control, and integrates successful attempts into the VLA training set. Evaluated across simulation and real-world manipulation tasks like block flipping, drawer closing, sweeping, twisting, and pouring, InSight allows for learning these skills without human demonstrations. The learned primitives can then be composed to execute novel, long-horizon tasks.

Key takeaway

For Robotics Engineers developing autonomous manipulation systems, InSight offers a path to overcome the limitations of fixed training data. You should consider integrating primitive-level steerability and a VLM-guided data flywheel to enable your VLA policies to autonomously acquire and compose new skills. This approach can significantly reduce reliance on human demonstrations for expanding task capabilities and tackling novel, long-horizon challenges.

Key insights

InSight enables autonomous, continuous skill acquisition for VLA models by making primitive actions steerable and self-generating training data.

Principles

Primitive steerability enhances VLA adaptability.
Self-guided data generation expands skill sets.
Composable primitives enable long-horizon tasks.

Method

InSight segments demonstrations into labeled primitives via VLM plan decomposition, then uses a VLM-guided flywheel to identify, autonomously demonstrate, label, and integrate missing primitives for novel tasks.

In practice

Apply VLM plan decomposition for primitive labeling.
Implement a data flywheel for autonomous skill growth.
Compose learned primitives for complex robotic tasks.

Topics

Vision-Language-Action Models
Robotic Manipulation
Skill Acquisition
Autonomous Learning
Primitive Actions
Data Flywheel

Best for: Research Scientist, AI Scientist, Robotics Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.