Willing but Unable: Separating Refusal from Capability in Code LLMs via Abliteration

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Software Development & Engineering · Depth: Expert, extended

Summary

The study investigates abliteration, a low-rank weight edit, to overcome refusal behaviors in safety-aligned code Large Language Models (LLMs) when asked to inject specific Common Weakness Enumerations (CWEs) into safe code. Focusing on Python and CWE-89 (SQL injection), researchers evaluated the Qwen2.5-Coder-Instruct family (3B, 7B, and 14B parameters) on 70 safe samples from PromSec and SafeCoder datasets. Findings show refusal is highly dependent on model size and prompt context, with the 14B model refusing 100% of prompts, while the 7B refused 73% on PromSec but only 5% on SafeCoder. Abliteration successfully reduced refusal to zero or near-zero across all models, preserving over 93% syntactic validity. However, the post-abliteration vulnerability injection rate remained capacity-bound, achieving 88-97% for the 14B, 89-90% for the 7B, and 25-48% for the 3B, demonstrating a clear separation between a model's "willingness" (refusal) and its "capability."

Key takeaway

For AI Security Engineers developing vulnerability detection systems, this research highlights that safety alignment in code LLMs can be bypassed via abliteration, enabling the generation of high-quality, specified vulnerable code. You should consider using abliterated 7B Qwen2.5-Coder models for efficient, targeted dataset augmentation, achieving ~90% injection rates. Be aware that open-weight models, even if safety-aligned, present a dual-use risk as adversaries can perform similar weight edits to generate malicious code.

Key insights

Abliteration separates code LLM refusal from capability, unlocking vulnerability injection without creating new skills.

Principles

Method

Abliteration involves orthogonally projecting out refusal directions from the residual stream via a low-rank weight edit, preventing models from writing these directions without retraining.

In practice

Topics

Code references

Best for: Research Scientist, CTO, AI Engineer, AI Scientist, AI Security Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.