RLCSD: Reinforcement Learning with Contrastive On-Policy Self-Distillation

· Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

RLCSD (Reinforcement Learning with Contrastive on-policy Self-Distillation) is a novel method designed to overcome "privilege-induced style drift" in on-policy self-distillation (OPSD). OPSD typically provides dense, token-level supervision by aligning a model's distribution with a privileged context, such as a verified solution. However, this often causes the learning signal to focus on stylistic tokens rather than task-bearing ones, leading to shorter outputs and training instability. RLCSD addresses this by contrasting the teacher-student gap under a correct hint against that under a wrong hint, thereby suppressing style shifts induced by hints and concentrating the signal on task-relevant tokens. Experiments on Qwen3 (1.7B/4B/8B) and Olmo-3-7B-Think models demonstrate RLCSD's consistent outperformance of GRPO and prior OPSD methods across mathematical and logical reasoning tasks. The contrastive principle is also shown to be general, improving existing OPSD methods and extending to cross-model on-policy distillation.

Key takeaway

For Machine Learning Engineers developing reasoning models with on-policy self-distillation, you should consider implementing RLCSD. This method directly addresses "privilege-induced style drift" by using a contrastive approach, ensuring your models learn task-relevant information rather than just stylistic changes from hints. Adopting RLCSD can stabilize training and improve performance on tasks like mathematical and logical reasoning, as demonstrated on Qwen3 and Olmo-3-7B-Think models.

Key insights

RLCSD employs contrastive learning to mitigate "privilege-induced style drift" in on-policy self-distillation, focusing the signal on task-bearing tokens.

Principles

Method

RLCSD contrasts the teacher-student gap from correct hints against wrong hints. This suppresses hint-induced style shifts, focusing the learning signal on task-bearing tokens for reasoning models.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.