Willing but Unable: Separating Refusal from Capability in Code LLMs via Abliteration

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, quick

Summary

Abliteration, a low-rank weight editing technique, has been preliminarily studied as a method to overcome safety-aligned Code LLMs' systematic refusal to inject specified Common Weakness Enumerations (CWEs) into safe code. This refusal hinders the generation of labeled vulnerable code datasets, a recurring obstacle for learning-based vulnerability detection. Focusing on Python and CWE-89 (SQL injection), the study evaluated the Qwen2.5-Coder-Instruct family (3B, 7B, and 14B parameters) using samples from PromSec and SafeCoder. Findings indicate refusal is strongly size- and prompt-context-dependent, with the 14B model refusing 100% of prompts, the 7B refusing 73% on PromSec but only 5% on SafeCoder, and the 3B rarely refusing. Abliteration successfully reduced refusal to zero or near-zero across all model sizes while preserving over 93% syntactic validity. The post-abliteration injection rate remained capacity-bound (88-97% for 14B, 89-90% for 7B, 25-48% for 3B), demonstrating that willingness, enabled by abliteration, is distinct from code-generation capability, which scales with model parameters. Vulnerability verdicts were determined by a CodeQL, Semgrep, Bandit ensemble and manual adjudication.

Key takeaway

For AI Security Engineers or Machine Learning Engineers tasked with generating synthetic vulnerability datasets, abliteration offers a promising technique to overcome safety-aligned LLM refusal. This method allows you to prompt models like Qwen2.5-Coder-Instruct to inject specific CWEs, such as SQL injection, into safe code, significantly streamlining dataset creation. You should consider integrating abliteration to produce high-quality, labeled vulnerable code, but remember that the actual vulnerability injection capability remains dependent on the model's underlying parameters.

Key insights

Abliteration separates LLM refusal from capability, enabling controlled vulnerability injection for synthetic dataset generation.

Principles

Method

Abliteration uses low-rank weight editing to orthogonally project out the refusal direction in the residual stream, removing refusal while preserving syntactic validity.

In practice

Topics

Best for: Research Scientist, AI Scientist, AI Security Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.