Willing but Unable: Separating Refusal from Capability in Code LLMs via Abliteration
Summary
Abliteration, a low-rank weight editing technique, has been preliminarily studied as a method to overcome safety-aligned Code LLMs' systematic refusal to inject specified Common Weakness Enumerations (CWEs) into safe code. This refusal hinders the generation of labeled vulnerable code datasets, a recurring obstacle for learning-based vulnerability detection. Focusing on Python and CWE-89 (SQL injection), the study evaluated the Qwen2.5-Coder-Instruct family (3B, 7B, and 14B parameters) using samples from PromSec and SafeCoder. Findings indicate refusal is strongly size- and prompt-context-dependent, with the 14B model refusing 100% of prompts, the 7B refusing 73% on PromSec but only 5% on SafeCoder, and the 3B rarely refusing. Abliteration successfully reduced refusal to zero or near-zero across all model sizes while preserving over 93% syntactic validity. The post-abliteration injection rate remained capacity-bound (88-97% for 14B, 89-90% for 7B, 25-48% for 3B), demonstrating that willingness, enabled by abliteration, is distinct from code-generation capability, which scales with model parameters. Vulnerability verdicts were determined by a CodeQL, Semgrep, Bandit ensemble and manual adjudication.
Key takeaway
For AI Security Engineers or Machine Learning Engineers tasked with generating synthetic vulnerability datasets, abliteration offers a promising technique to overcome safety-aligned LLM refusal. This method allows you to prompt models like Qwen2.5-Coder-Instruct to inject specific CWEs, such as SQL injection, into safe code, significantly streamlining dataset creation. You should consider integrating abliteration to produce high-quality, labeled vulnerable code, but remember that the actual vulnerability injection capability remains dependent on the model's underlying parameters.
Key insights
Abliteration separates LLM refusal from capability, enabling controlled vulnerability injection for synthetic dataset generation.
Principles
- LLM refusal is distinct from code generation capability.
- Refusal behavior is model size and context dependent.
- Low-rank weight edits can target specific behaviors.
Method
Abliteration uses low-rank weight editing to orthogonally project out the refusal direction in the residual stream, removing refusal while preserving syntactic validity.
In practice
- Generate vulnerable code datasets from safe seeds.
- Test LLM safety alignments for specific CWEs.
- Fine-tune LLMs for controlled vulnerability injection.
Topics
- Abliteration
- Code LLMs
- Vulnerability Detection
- SQL Injection
- Low-Rank Editing
- Synthetic Data Generation
Best for: Research Scientist, AI Scientist, AI Security Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.