FiberTune: Preserving Action-Fiber Visual Residuals in Vision-Language-Action Fine-Tuning
Summary
FiberTune is a novel training-time objective designed to address residual visual collapse in vision-language-action (VLA) policies during action-supervised fine-tuning. This collapse occurs when visual structure consistent across action-equivalent states is lost, a phenomenon formalized as residual visual collapse along local action fibers. FiberTune preserves teacher-structured visual residuals without incurring inference-time overhead by employing an online action probe to estimate action-predictive feature directions. It then filters these directions from intermediate visual-token representations, aligning the probe-filtered residuals to a frozen visual teacher while regularizing their effective rank. This method consistently improved performance, achieving +10.7 percentage points SR(5) on long-horizon CALVIN ABC-to-D and increasing physical SO-101 pick-place task success from 72.7% to 78.1% across pi_0.5 and OpenVLA-OFT architectures.
Key takeaway
For Machine Learning Engineers fine-tuning vision-language-action policies, you should consider integrating FiberTune to prevent visual structure collapse. This method significantly boosts task success, as demonstrated by a +10.7 percentage point gain on CALVIN ABC-to-D and a rise from 72.7% to 78.1% on physical SO-101 pick-place tasks. Implementing FiberTune can enhance the robustness and performance of your VLA models without adding inference overhead.
Key insights
FiberTune prevents visual collapse in VLA fine-tuning by preserving teacher-structured visual residuals, improving policy performance without inference overhead.
Principles
- Action-supervised fine-tuning risks visual structure collapse.
- Preserve teacher-structured visual residuals for VLA policies.
- Filter action-predictive features to isolate visual residuals.
Method
FiberTune uses an online action probe to estimate and filter action-predictive feature directions from visual-token representations. It then aligns these probe-filtered residuals to a frozen visual teacher, regularizing their effective rank.
In practice
- Apply to VLA policies like pi_0.5 and OpenVLA-OFT.
- Improves long-horizon tasks like CALVIN ABC-to-D.
- Enhances physical robot manipulation success.
Topics
- Vision-Language-Action Policies
- Robotic Manipulation
- Fine-Tuning
- FiberTune
- Visual Residuals
- Policy Learning
Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Robotics Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.