RLCSD: Reinforcement Learning with Contrastive On-Policy Self-Distillation
Summary
RLCSD (Reinforcement Learning with Contrastive on-policy Self-Distillation) is a novel method designed to overcome "privilege-induced style drift" in on-policy self-distillation (OPSD). OPSD typically provides dense, token-level supervision by aligning a model's distribution with a privileged context, such as a verified solution. However, this often causes the learning signal to focus on stylistic tokens rather than task-bearing ones, leading to shorter outputs and training instability. RLCSD addresses this by contrasting the teacher-student gap under a correct hint against that under a wrong hint, thereby suppressing style shifts induced by hints and concentrating the signal on task-relevant tokens. Experiments on Qwen3 (1.7B/4B/8B) and Olmo-3-7B-Think models demonstrate RLCSD's consistent outperformance of GRPO and prior OPSD methods across mathematical and logical reasoning tasks. The contrastive principle is also shown to be general, improving existing OPSD methods and extending to cross-model on-policy distillation.
Key takeaway
For Machine Learning Engineers developing reasoning models with on-policy self-distillation, you should consider implementing RLCSD. This method directly addresses "privilege-induced style drift" by using a contrastive approach, ensuring your models learn task-relevant information rather than just stylistic changes from hints. Adopting RLCSD can stabilize training and improve performance on tasks like mathematical and logical reasoning, as demonstrated on Qwen3 and Olmo-3-7B-Think models.
Key insights
RLCSD employs contrastive learning to mitigate "privilege-induced style drift" in on-policy self-distillation, focusing the signal on task-bearing tokens.
Principles
- Hinting can cause "privilege-induced style drift."
- Contrast correct and wrong hint gaps for signal focus.
- Contrastive learning improves on-policy distillation.
Method
RLCSD contrasts the teacher-student gap from correct hints against wrong hints. This suppresses hint-induced style shifts, focusing the learning signal on task-bearing tokens for reasoning models.
In practice
- Integrate contrastive principle into OPSD methods.
- Extend contrastive insight to cross-model distillation.
- Apply RLCSD to mathematical and logical reasoning.
Topics
- RLCSD
- Contrastive Learning
- On-Policy Self-Distillation
- Reasoning Models
- Mathematical Reasoning
- Logical Reasoning
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.