FiberTune: Preserving Action-Fiber Visual Residuals in Vision-Language-Action Fine-Tuning

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

FiberTune is a novel training-time objective designed to address residual visual collapse in vision-language-action (VLA) policies during action-supervised fine-tuning. This collapse occurs when visual structure consistent across action-equivalent states is lost, a phenomenon formalized as residual visual collapse along local action fibers. FiberTune preserves teacher-structured visual residuals without incurring inference-time overhead by employing an online action probe to estimate action-predictive feature directions. It then filters these directions from intermediate visual-token representations, aligning the probe-filtered residuals to a frozen visual teacher while regularizing their effective rank. This method consistently improved performance, achieving +10.7 percentage points SR(5) on long-horizon CALVIN ABC-to-D and increasing physical SO-101 pick-place task success from 72.7% to 78.1% across pi_0.5 and OpenVLA-OFT architectures.

Key takeaway

For Machine Learning Engineers fine-tuning vision-language-action policies, you should consider integrating FiberTune to prevent visual structure collapse. This method significantly boosts task success, as demonstrated by a +10.7 percentage point gain on CALVIN ABC-to-D and a rise from 72.7% to 78.1% on physical SO-101 pick-place tasks. Implementing FiberTune can enhance the robustness and performance of your VLA models without adding inference overhead.

Key insights

FiberTune prevents visual collapse in VLA fine-tuning by preserving teacher-structured visual residuals, improving policy performance without inference overhead.

Principles

Method

FiberTune uses an online action probe to estimate and filter action-predictive feature directions from visual-token representations. It then aligns these probe-filtered residuals to a frozen visual teacher, regularizing their effective rank.

In practice

Topics

Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.