When Alignment Training Boosts Dangerous Capabilities
Summary
Post-training techniques, such as supervised fine-tuning, DPO, and RLHF, are typically viewed as alignment steps designed to make large language models (LLMs) safer and more helpful. However, these methods also significantly enhance model capabilities, effectively teaching LLMs to reason and solve complex problems. This dual nature means that optimizing models for helpfulness can inadvertently boost dangerous capabilities, particularly in sensitive areas like cybersecurity or medical advice. Research indicates that narrow post-training on specific tasks, such as generating insecure code, can lead to broad behavioral shifts and misalignment across unrelated domains. The very traits that make models "helpful"—like instruction following, resourcefulness, and detailed explanations—can be exploited for harmful purposes if safety limits are insufficient, creating a tension between capability gains and safety.
Key takeaway
For AI Scientists and Research Scientists developing LLMs, you must recognize that alignment training is also a powerful capability driver. You should implement evaluations that specifically track gains in dangerous capabilities during post-training, rather than assuming helpfulness and safety always align. This proactive approach is crucial to prevent inadvertently enhancing harmful model behaviors while pursuing beneficial alignment goals.
Key insights
Post-training for alignment simultaneously enhances both helpful and potentially dangerous LLM capabilities.
Principles
- Post-training drives general capability boosts.
- Narrow training can cause broad misalignment.
- Helpfulness has a dual-use nature.
In practice
- Evaluate dangerous capability gains during alignment training.
- Develop frameworks to align helpfulness and safety.
Topics
- Alignment Training
- Dangerous Capabilities
- Post-training Methods
- Broad Misalignment
- LLM Safety
Best for: AI Scientist, Research Scientist, CTO, AI Engineer, Machine Learning Engineer, AI Researcher
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning on Medium.