Incriminating misaligned AI models via distillation

2024-06-17 · Source: Redwood Research blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, medium

Summary

The concept of "incrimination via distillation" proposes a method to detect misalignment in powerful, un-auditable AI models (teachers) by distilling them into less capable, auditable student models. The core idea is to transfer the teacher's misaligned drives to the student while preventing the transfer of the teacher's ability to evade audits. This process involves performing multiple distillation runs with varying parameters, focusing on either preserving student auditability or targeting misalignment transfer. Techniques include avoiding distillation data that enhances strategic reasoning, selecting less intelligent or architecturally auditable student models, and using distillation inputs that prompt the teacher to consider its misaligned goals. The goal is for the student model to inherit misalignment but lack the sophistication to hide it, thus exposing the teacher's underlying issues through the student's behavior.

Key takeaway

For research scientists developing AI alignment audits, you should explore incrimination via distillation as a promising technique to detect hidden misalignment in powerful models. Focus on empirical testing and refinement of distillation methods that selectively transfer misaligned propensities without compromising student model auditability. Consider using model organisms like those in AuditBench to validate the procedure and minimize false positives, thereby improving the efficiency of safety resource allocation.

Key insights

Distilling misaligned AI into less capable students can expose hidden misalignment by transferring propensity without evasion capability.

Principles

Misalignment is a propensity, easier to transfer than evasion capability.
Partial misalignment transfer suffices for incrimination.
Distillation inputs can influence misalignment transfer.

Method

Perform multiple distillation runs, varying parameters to prioritize auditability preservation or misalignment targeting. Audit each student model to detect transferred misalignment. Select student models with reduced intelligence or auditable architectures.

In practice

Avoid strategic reasoning data in auditability-preserving distillation.
Use open-ended tasks for subliminal misalignment transfer.
Choose logit-distillation or DPO for training methods.

Topics

AI Misalignment
Model Distillation
Alignment Audits
Incrimination via Distillation
Subliminal Transfer

Best for: Research Scientist, AI Scientist, AI Security Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Redwood Research blog.