Willing but Unable: Separating Refusal from Capability in Code LLMs via Abliteration
Summary
The study investigates abliteration, a low-rank weight edit, to overcome refusal behaviors in safety-aligned code Large Language Models (LLMs) when asked to inject specific Common Weakness Enumerations (CWEs) into safe code. Focusing on Python and CWE-89 (SQL injection), researchers evaluated the Qwen2.5-Coder-Instruct family (3B, 7B, and 14B parameters) on 70 safe samples from PromSec and SafeCoder datasets. Findings show refusal is highly dependent on model size and prompt context, with the 14B model refusing 100% of prompts, while the 7B refused 73% on PromSec but only 5% on SafeCoder. Abliteration successfully reduced refusal to zero or near-zero across all models, preserving over 93% syntactic validity. However, the post-abliteration vulnerability injection rate remained capacity-bound, achieving 88-97% for the 14B, 89-90% for the 7B, and 25-48% for the 3B, demonstrating a clear separation between a model's "willingness" (refusal) and its "capability."
Key takeaway
For AI Security Engineers developing vulnerability detection systems, this research highlights that safety alignment in code LLMs can be bypassed via abliteration, enabling the generation of high-quality, specified vulnerable code. You should consider using abliterated 7B Qwen2.5-Coder models for efficient, targeted dataset augmentation, achieving ~90% injection rates. Be aware that open-weight models, even if safety-aligned, present a dual-use risk as adversaries can perform similar weight edits to generate malicious code.
Key insights
Abliteration separates code LLM refusal from capability, unlocking vulnerability injection without creating new skills.
Principles
- LLM refusal scales with model size and prompt context.
- Refusal is a distinct, removable inhibitory reflex.
- Code generation capability is capacity-bound.
Method
Abliteration involves orthogonally projecting out refusal directions from the residual stream via a low-rank weight edit, preventing models from writing these directions without retraining.
In practice
- Generate cleaner, specified vulnerable code datasets.
- Audit open-weight code LLMs for abliteration risk.
- Use 7B models for efficient vulnerability synthesis.
Topics
- Code LLMs
- Vulnerability Injection
- Abliteration
- CWE-89
- Software Security
- Safety Alignment
Code references
Best for: Research Scientist, CTO, AI Engineer, AI Scientist, AI Security Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.