Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation
Summary
The Rubric-Conditioned Self-Distillation framework addresses limitations in post-training reasoning language models, which typically rely on expensive, noisy chain-of-thought annotations for supervised distillation or scalar feedback in reinforcement learning. This new method integrates rubrics as structured, fine-grained feedback for on-policy self-distillation. It conditions a teacher model on criterion-level rubrics to provide token-level guidance on a student model's sampled trajectories, moving beyond single reference rationales. This approach allows rubrics to specify what constitutes a strong response, enabling more precise credit assignment during the reasoning process than traditional scalar reward optimization. The framework is implemented via a two-stage pipeline that first generates task-specific rubrics and then trains a rubric-guided reasoner. Evaluations on diverse science reasoning benchmarks demonstrate its effectiveness, showing it surpasses GRPO by 1.0 points and OPSD by 0.9 points on average.
Key takeaway
For Machine Learning Engineers developing reasoning language models, if you are struggling with the limitations of noisy CoT annotations or scalar rewards, consider adopting Rubric-Conditioned Self-Distillation. This approach allows you to provide fine-grained, token-level guidance using structured rubrics, potentially improving model performance. Implement a two-stage pipeline to generate task-specific rubrics and train your reasoner, aiming to surpass traditional methods like GRPO and OPSD.
Key insights
Rubric-Conditioned Self-Distillation uses structured rubrics for fine-grained, token-level guidance in language model training, outperforming scalar rewards.
Principles
- Rubrics provide fine-grained, structured feedback for reasoning processes.
- Token-level guidance from rubrics improves learning over scalar rewards.
- Self-distillation can leverage rubrics to avoid single reference rationale dependency.
Method
A two-stage pipeline first generates task-specific rubrics, then trains a reasoner using rubric-conditioned teacher guidance on student trajectories.
In practice
- Integrate criterion-level rubrics for detailed feedback in LM post-training.
- Develop a system to generate task-specific rubrics for model guidance.
Topics
- Rubric-Conditioned Self-Distillation
- Language Model Reasoning
- Reinforcement Learning
- Self-Distillation
- Reward Supervision
- Fine-grained Feedback
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.