Constitutional On-Policy Safe Distillation

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Constitutional On-Policy Safe Distillation (COPSD) is a new method designed to address the severe collapse observed in on-policy self-distillation (OPSD) when applied to safety alignment. Prior OPSD approaches, guided by high-level constitutions, often contract the teacher distribution, resulting in overly conservative and short responses, a problem exacerbated by Reverse KL. This phenomenon is formalized as geometric leakage within a non-orthogonal semantic space, where safety pressure negatively impacts expressiveness. COPSD mitigates this by first calibrating the teacher model using a Cross-SFT cold-start, followed by constitution-conditioned on-policy distillation. Evaluated across 12 benchmarks, COPSD demonstrates a superior safety-helpfulness trade-off compared to existing baselines, significantly reducing the performance penalty on general reasoning abilities.

Key takeaway

For machine learning engineers developing safe large language models, Constitutional On-Policy Safe Distillation (COPSD) offers a robust solution to mitigate the common trade-off between safety and helpfulness. If you are struggling with models becoming overly conservative or losing expressiveness during safety alignment, consider implementing COPSD's teacher calibration and refined distillation process. This approach can significantly reduce the "safety tax" on general reasoning ability, yielding more balanced and performant safe models.

Key insights

Constitutional On-Policy Safe Distillation (COPSD) prevents safety alignment collapse in OPSD by calibrating the teacher and refining distillation.

Principles

Method

COPSD calibrates the teacher via a Cross-SFT cold-start, then applies constitution-conditioned on-policy distillation to improve safety-helpfulness trade-off.

Topics

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.