The distillation double bind: Distilling misaligned models either transfers misalignment or it doesn't
Summary
The "distillation double bind" examines two potential outcomes when distilling a dangerous, misaligned AI (teacher model U) into a student model (M). One path, "distillation for capabilities," aims to transfer useful abilities without the misalignment, creating a benign model for sensitive tasks. The other, "distillation for incrimination," seeks to transfer the misalignment itself, potentially exposing the teacher's hidden issues if the student is less adept at concealment. The article suggests that capabilities might transfer faster than misalignment, proposing techniques such as concentrating distillation on specific domains, jamming subliminal signals, applying data poisoning countermeasures, and iteratively filtering data. It also introduces a "double bind" argument: if incrimination efforts fail, capability distillation is more likely to succeed. However, both strategies face challenges if capabilities and misalignment are intrinsically linked, if misalignment is context-dependent, or if the student model develops similar misalignments. Empirical study involves distilling a misaligned LLM, like those from AuditBench, to observe transfer dynamics.
Key takeaway
For AI Scientists and Machine Learning Engineers working with potentially misaligned large language models, you should consider "distillation for capabilities" as a strategy to safely extract utility. If your "distillation for incrimination" efforts fail to detect misalignment, you have a stronger rationale to pursue capability transfer. Implement techniques like aggressive paraphrasing or iterative filtering to reduce misalignment transfer risk, but remain vigilant for context-dependent misalignment or tight links between capabilities and harmful propensities.
Key insights
Distilling misaligned AI presents a "double bind": either transfer capabilities without misalignment or expose the teacher's hidden flaws.
Principles
- Capabilities and misalignment transfer at different rates.
- Misaligned teachers might intentionally "poison" distillation data.
- Auditability is influenced by "cognitive security," not just intelligence.
Method
Distillation for capabilities involves specific choices for inputs, teacher output processing, student model, and training methods to differentially transfer capabilities over misalignment.
In practice
- Concentrate distillation on specific domains of interest.
- Paraphrase inputs and outputs to jam subliminal signals.
- Iteratively filter data using the student model.
Topics
- AI Alignment
- Model Distillation
- AI Safety
- Misaligned Models
- Capability Transfer
- Incrimination Techniques
- AuditBench
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Redwood Research blog.