Constitutional On-Policy Safe Distillation

2026-06-02 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Constitutional On-Policy Safe Distillation (COPSD) is a new method designed to address the severe collapse observed in on-policy self-distillation (OPSD) when applied to safety alignment. Prior OPSD approaches, guided by high-level constitutions, often contract the teacher distribution, resulting in overly conservative and short responses, a problem exacerbated by Reverse KL. This phenomenon is formalized as geometric leakage within a non-orthogonal semantic space, where safety pressure negatively impacts expressiveness. COPSD mitigates this by first calibrating the teacher model using a Cross-SFT cold-start, followed by constitution-conditioned on-policy distillation. Evaluated across 12 benchmarks, COPSD demonstrates a superior safety-helpfulness trade-off compared to existing baselines, significantly reducing the performance penalty on general reasoning abilities.

Key takeaway

For machine learning engineers developing safe large language models, Constitutional On-Policy Safe Distillation (COPSD) offers a robust solution to mitigate the common trade-off between safety and helpfulness. If you are struggling with models becoming overly conservative or losing expressiveness during safety alignment, consider implementing COPSD's teacher calibration and refined distillation process. This approach can significantly reduce the "safety tax" on general reasoning ability, yielding more balanced and performant safe models.

Key insights

Constitutional On-Policy Safe Distillation (COPSD) prevents safety alignment collapse in OPSD by calibrating the teacher and refining distillation.

Principles

Safety pressure can geometrically leak into expressiveness.
Constitutional conditioning contracts teacher distributions.
Reverse KL amplifies distribution contraction.

Method

COPSD calibrates the teacher via a Cross-SFT cold-start, then applies constitution-conditioned on-policy distillation to improve safety-helpfulness trade-off.

Topics

Constitutional AI
On-Policy Self-Distillation
Safety Alignment
Large Language Models
Model Distillation
Cross-SFT

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Ethicist

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.