Advice for making robust-to-training model organisms
Summary
Redwood Research identifies critical factors for developing "model organisms" of AI misalignment that are robust to untargeted training, addressing their common fragility. The study, using Llama-3.1-8B-Instruct and 15 backdoor types, found that prompted model organisms are extremely fragile, with static-prompt versions easily defeated by "Pirate SFT" and dynamic-prompt versions by "Artificial Pirate Removal". Full-weight fine-tuning (FWFT) creates more robust model organisms than LoRA, with robustness increasing with LoRA rank. Additionally, explicit trigger backdoors are more robust, and password-locked organisms become less robust as the proportion of password-containing training data increases. Factors like training duration, CoT-distilling, model size, and weight decay showed limited or no consistent improvement in robustness.
Key takeaway
For AI alignment researchers developing and evaluating training techniques, you must prioritize creating robust model organisms to ensure valid results. Avoid using prompted model organisms, which are highly fragile. Instead, opt for full-weight fine-tuning (FWFT) over LoRA for backdoor insertion, and carefully manage password-locking data ratios. Your evaluation of sophisticated techniques will be unreliable if the underlying misbehavior is easily removed by simple untargeted training.
Key insights
Fragile AI misalignment model organisms hinder technique evaluation; robustness requires specific training and design choices.
Principles
- Avoid prompted model organisms.
- Prefer full-weight fine-tuning for robustness.
- Limit password-locking data ratio.
Method
Red team creates backdoored Llama-3.1-8B-Instruct models; Blue team applies untargeted SFT (QA, Pirate) via LoRA. Robustness is measured by "performance–backdoor behavior rate" Pareto curves using MATH-500.
In practice
- Use FWFT over LoRA for robust backdoors.
- Design explicit triggers for resilience.
- Minimize password data in training.
Topics
- AI Alignment
- Model Organisms
- Backdoor Attacks
- Fine-tuning
- LoRA
- Untargeted Training
Best for: Research Scientist, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AI Alignment Forum.