ReCal: Reward Calibration for RL-based LLM Routing
Summary
ReCal is a Reward Calibration framework designed to improve Reinforcement Learning (RL)-based Large Language Model (LLM) routing. It addresses issues in existing methods, which struggle with ambiguous credit assignment from scalarized rewards and optimization bias due to heterogeneous task difficulty and reward variability. ReCal introduces a two-stage calibration process: first, a hierarchical reward decomposition mechanism with component-wise advantage estimation provides clearer learning signals. Second, a distribution-aware optimization strategy calibrates variability through variance-aware reweighting and per-dataset normalization. Experiments across seven datasets demonstrate that ReCal consistently improves routing performance and training stability compared to baselines.
Key takeaway
For Machine Learning Engineers optimizing LLM routing policies, ReCal offers a robust approach to overcome common training instabilities and biases. By adopting its hierarchical reward decomposition and distribution-aware optimization, you can achieve clearer credit assignment and more stable policy updates, particularly across diverse tasks and varying query difficulties. Consider implementing component-wise advantage estimation and variance-aware reweighting to significantly enhance your routing system's performance and reliability.
Key insights
ReCal calibrates RL-based LLM routing rewards by disentangling objectives and normalizing distributions for clearer, more stable policy learning.
Principles
- Decompose scalar rewards into objective-specific components.
- Calibrate optimization signals across heterogeneous data distributions.
- Prioritize uncertain routing cases with higher optimization weights.
Method
ReCal uses hierarchical reward decomposition for component-wise advantage estimation, then applies variance-aware reweighting for uncertain instances and per-dataset normalization to align optimization scales, integrating into a GRPO-style objective.
In practice
- Define reward components for answer, info, format, route, and balance.
- Activate auxiliary rewards only above an answer correctness threshold.
- Use exponential moving average for routing frequency statistics.
Topics
- LLM Routing
- Reinforcement Learning
- Reward Calibration
- Policy Optimization
- Advantage Estimation
- Multi-objective RL
- PPO
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.