Constitutional On-Policy Safe Distillation
Summary
Constitutional On-Policy Safe Distillation (COPSD) is a new method designed to address the severe collapse observed in on-policy self-distillation (OPSD) when applied to safety alignment. Prior OPSD approaches, guided by high-level constitutions, often contract the teacher distribution, resulting in overly conservative and short responses, a problem exacerbated by Reverse KL. This phenomenon is formalized as geometric leakage within a non-orthogonal semantic space, where safety pressure negatively impacts expressiveness. COPSD mitigates this by first calibrating the teacher model using a Cross-SFT cold-start, followed by constitution-conditioned on-policy distillation. Evaluated across 12 benchmarks, COPSD demonstrates a superior safety-helpfulness trade-off compared to existing baselines, significantly reducing the performance penalty on general reasoning abilities.
Key takeaway
For machine learning engineers developing safe large language models, Constitutional On-Policy Safe Distillation (COPSD) offers a robust solution to mitigate the common trade-off between safety and helpfulness. If you are struggling with models becoming overly conservative or losing expressiveness during safety alignment, consider implementing COPSD's teacher calibration and refined distillation process. This approach can significantly reduce the "safety tax" on general reasoning ability, yielding more balanced and performant safe models.
Key insights
Constitutional On-Policy Safe Distillation (COPSD) prevents safety alignment collapse in OPSD by calibrating the teacher and refining distillation.
Principles
- Safety pressure can geometrically leak into expressiveness.
- Constitutional conditioning contracts teacher distributions.
- Reverse KL amplifies distribution contraction.
Method
COPSD calibrates the teacher via a Cross-SFT cold-start, then applies constitution-conditioned on-policy distillation to improve safety-helpfulness trade-off.
Topics
- Constitutional AI
- On-Policy Self-Distillation
- Safety Alignment
- Large Language Models
- Model Distillation
- Cross-SFT
Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.