Advice for making robust-to-training model organisms

2024-06-17 · Source: Redwood Research blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, long

Summary

This research investigates factors influencing the robustness of "model organisms" used to study misaligned AI systems against untargeted training. The study, using Llama-3.1-8B-Instruct and Qwen3-32B, found that prompted model organisms are extremely fragile, with static-prompt versions easily defeated by Pirate SFT and dynamic-prompt versions vulnerable to Artificial Pirate Removal (APR). Full-weight fine-tuning (FWFT) model organisms proved more robust than LoRA-trained ones, especially as LoRA rank increased. Password-locked model organisms were less robust when a larger fraction of training data contained the password. The research also noted that specific backdoor behaviors and triggers affect robustness, with "explicit trigger backdoors" being more resilient. Factors like training longer, CoT-distilling, larger model sizes (e.g., Qwen-2.5 lineup from 1.5B to 72B), SOAP optimizer, or weight decay did not consistently increase robustness.

Key takeaway

For AI scientists developing techniques to control misaligned AI systems, you should prioritize creating robust model organisms to ensure your evaluation results are meaningful. Avoid using fragile prompted model organisms. Instead, opt for full-weight fine-tuning (FWFT) and consider the specific backdoor behavior and trigger type, as these significantly impact resilience. You must test untargeted training as a baseline; otherwise, sophisticated techniques might appear effective simply because the model organism was weak.

Key insights

Model organisms for AI misalignment are often fragile to untargeted training; robustness depends on creation method and behavior type.

Principles

Avoid prompted model organisms.
FWFT yields more robust models.
Higher LoRA rank increases robustness.

Method

Red teams create backdoored Llama-3.1-8B-Instruct models with triggers; blue teams apply untargeted SFT (e.g., Pirate SFT, QA SFT, APR) to remove misbehavior while preserving MATH-500 capabilities.

In practice

Use full-weight fine-tuning for robustness.
Minimize password-locking data.
Test untargeted training baselines.

Topics

AI Misalignment
Model Organisms
Untargeted Training
Full-Weight Fine-Tuning
LoRA
Backdoor Triggers
Model Robustness

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Redwood Research blog.