Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

Rubric-Conditioned Self-Distillation (RCSD) is a novel post-training framework designed to enhance reasoning language models by integrating structured, fine-grained rubrics into on-policy self-distillation. Unlike traditional distillation, which often uses costly and noisy chain-of-thought annotations, or reinforcement learning that compresses evaluative feedback into a sparse scalar signal, RCSD conditions a teacher model on criterion-level rubrics. This allows the teacher to provide dense, token-level guidance on the student's own sampled trajectories, avoiding reliance on a single reference rationale. The framework employs a two-stage pipeline: first, a rubric generator learns to produce task-specific evaluation criteria, and then a reasoner is trained using these rubrics for structured guidance. Evaluated on diverse science reasoning benchmarks with a Qwen3-8B backbone, RCSD achieved an average score of 70.6, outperforming Group Relative Policy Optimization (GRPO) by 1.0-1.4 points and On-Policy Self-Distillation (OPSD) by 0.9 points. It also demonstrated competitive generalization on medical question answering tasks.

Key takeaway

For Machine Learning Engineers developing reasoning models for open-ended or hard-to-verify tasks, you should consider Rubric-Conditioned Self-Distillation (RCSD). This method moves beyond sparse scalar rewards or single-reference distillation by providing dense, criterion-aware, token-level guidance. Implementing RCSD can lead to more stable, efficient, and internally consistent reasoning trajectories, as demonstrated by its superior performance on scientific and rubric-based benchmarks. Your models will benefit from structured feedback that preserves distinctions across evaluation dimensions, reducing repetitive or contradictory outputs.

Key insights

Rubrics provide fine-grained, token-level guidance for LLM self-distillation, outperforming scalar rewards or single-reference paths.

Principles

Method

A two-stage pipeline trains a rubric generator from privileged data, then trains a reasoner using on-policy rubric-conditioned distillation, where the teacher is guided by the generated rubrics.

In practice

Topics

Code references

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.