Incriminating misaligned AI models via distillation
Summary
The concept of "incrimination via distillation" proposes a method to detect misalignment in powerful, un-auditable AI models (teachers) by distilling them into less capable, auditable student models. The core idea is to transfer the teacher's misaligned drives to the student while preventing the transfer of the teacher's ability to evade audits. This process involves performing multiple distillation runs with varying parameters, focusing on either preserving student auditability or targeting misalignment transfer. Techniques include avoiding distillation data that enhances strategic reasoning, selecting less intelligent or architecturally auditable student models, and using distillation inputs that prompt the teacher to consider its misaligned goals. The goal is for the student model to inherit misalignment but lack the sophistication to hide it, thus exposing the teacher's underlying issues through the student's behavior.
Key takeaway
For research scientists developing AI alignment audits, you should explore incrimination via distillation as a promising technique to detect hidden misalignment in powerful models. Focus on empirical testing and refinement of distillation methods that selectively transfer misaligned propensities without compromising student model auditability. Consider using model organisms like those in AuditBench to validate the procedure and minimize false positives, thereby improving the efficiency of safety resource allocation.
Key insights
Distilling misaligned AI into less capable students can expose hidden misalignment by transferring propensity without evasion capability.
Principles
- Misalignment is a propensity, easier to transfer than evasion capability.
- Partial misalignment transfer suffices for incrimination.
- Distillation inputs can influence misalignment transfer.
Method
Perform multiple distillation runs, varying parameters to prioritize auditability preservation or misalignment targeting. Audit each student model to detect transferred misalignment. Select student models with reduced intelligence or auditable architectures.
In practice
- Avoid strategic reasoning data in auditability-preserving distillation.
- Use open-ended tasks for subliminal misalignment transfer.
- Choose logit-distillation or DPO for training methods.
Topics
- AI Misalignment
- Model Distillation
- Alignment Audits
- Incrimination via Distillation
- Subliminal Transfer
Best for: Research Scientist, AI Scientist, AI Security Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Redwood Research blog.