Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes
Summary
Hierarchical Advantage-Weighted Behavior Cloning (HABC) is a new method for online reinforcement learning fine-tuning of Vision-Language-Action (VLA) policies, designed to overcome limitations of sparse binary episode outcomes. Existing approaches often reduce success/failure to a single scalar reward, which conflates viability and efficiency objectives and leads to poor guidance once basic success is achieved. Additionally, these methods struggle with incorrect credit assignment when real-world rollouts mix autonomous and intervention segments. HABC addresses this by training separate critic heads for viability and efficiency on distinct data subsets, merging their one-step advantages with a state-adaptive gate g_t. This gate prioritizes viability during uncertainty and shifts to efficiency when viability is high, generating per-transition weights for the actor loss. Intervention-aware credit assignment further ensures supervision applies only to segments executed by the current policy. Real-robot experiments on three contact-rich bimanual tasks demonstrated significant improvements, raising success rates from supervised fine-tuning baselines of 36%, 44%, and 12% to 92%, 88%, and 38%.
Key takeaway
For Robotics Engineers fine-tuning Vision-Language-Action policies with online reinforcement learning, consider adopting Hierarchical Advantage-Weighted Behavior Cloning (HABC). This method directly addresses the limitations of sparse binary rewards and mixed autonomous/intervention rollouts, which often hinder performance. By implementing HABC's hierarchical objective balancing and intervention-aware credit assignment, you can significantly boost success rates on complex, contact-rich bimanual tasks, as demonstrated by improvements from 36% to 92% and 12% to 38%.
Key insights
HABC improves VLA fine-tuning by hierarchically balancing viability and efficiency objectives with intervention-aware credit assignment.
Principles
- Conflating viability and efficiency in RL feedback limits guidance.
- Sparse binary outcomes provide limited gradient for efficient task completion.
- Intervention-aware credit assignment prevents supervision leakage.
Method
HABC trains separate critic heads for viability and efficiency, combining their outputs with a state-adaptive gate g_t to generate per-transition actor loss weights. It uses intervention-aware credit assignment.
In practice
- Apply HABC to fine-tune VLAs on contact-rich bimanual tasks.
- Use separate critics for distinct objectives like viability and efficiency.
- Implement state-adaptive weighting for objective prioritization.
Topics
- Hierarchical Advantage-Weighted Behavior Cloning
- Vision-Language-Action Policies
- Online Reinforcement Learning
- Robot Fine-Tuning
- Credit Assignment
- Bimanual Tasks
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.