ReCal: Reward Calibration for RL-based LLM Routing

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Large Language Models · Depth: Expert, long

Summary

ReCal is a Reward Calibration framework designed to improve Reinforcement Learning (RL)-based Large Language Model (LLM) routing. It addresses issues in existing methods, which struggle with ambiguous credit assignment from scalarized rewards and optimization bias due to heterogeneous task difficulty and reward variability. ReCal introduces a two-stage calibration process: first, a hierarchical reward decomposition mechanism with component-wise advantage estimation provides clearer learning signals. Second, a distribution-aware optimization strategy calibrates variability through variance-aware reweighting and per-dataset normalization. Experiments across seven datasets demonstrate that ReCal consistently improves routing performance and training stability compared to baselines.

Key takeaway

For Machine Learning Engineers optimizing LLM routing policies, ReCal offers a robust approach to overcome common training instabilities and biases. By adopting its hierarchical reward decomposition and distribution-aware optimization, you can achieve clearer credit assignment and more stable policy updates, particularly across diverse tasks and varying query difficulties. Consider implementing component-wise advantage estimation and variance-aware reweighting to significantly enhance your routing system's performance and reliability.

Key insights

ReCal calibrates RL-based LLM routing rewards by disentangling objectives and normalizing distributions for clearer, more stable policy learning.

Principles

Decompose scalar rewards into objective-specific components.
Calibrate optimization signals across heterogeneous data distributions.
Prioritize uncertain routing cases with higher optimization weights.

Method

ReCal uses hierarchical reward decomposition for component-wise advantage estimation, then applies variance-aware reweighting for uncertain instances and per-dataset normalization to align optimization scales, integrating into a GRPO-style objective.

In practice

Define reward components for answer, info, format, route, and balance.
Activate auxiliary rewards only above an answer correctness threshold.
Use exponential moving average for routing frequency statistics.

Topics

LLM Routing
Reinforcement Learning
Reward Calibration
Policy Optimization
Advantage Estimation
Multi-objective RL
PPO

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.