Willing but Unable: Separating Refusal from Capability in Code LLMs via Abliteration

2026-06-05 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Software Development & Engineering · Depth: Expert, extended

Summary

The study investigates abliteration, a low-rank weight edit, to overcome refusal behaviors in safety-aligned code Large Language Models (LLMs) when asked to inject specific Common Weakness Enumerations (CWEs) into safe code. Focusing on Python and CWE-89 (SQL injection), researchers evaluated the Qwen2.5-Coder-Instruct family (3B, 7B, and 14B parameters) on 70 safe samples from PromSec and SafeCoder datasets. Findings show refusal is highly dependent on model size and prompt context, with the 14B model refusing 100% of prompts, while the 7B refused 73% on PromSec but only 5% on SafeCoder. Abliteration successfully reduced refusal to zero or near-zero across all models, preserving over 93% syntactic validity. However, the post-abliteration vulnerability injection rate remained capacity-bound, achieving 88-97% for the 14B, 89-90% for the 7B, and 25-48% for the 3B, demonstrating a clear separation between a model's "willingness" (refusal) and its "capability."

Key takeaway

For AI Security Engineers developing vulnerability detection systems, this research highlights that safety alignment in code LLMs can be bypassed via abliteration, enabling the generation of high-quality, specified vulnerable code. You should consider using abliterated 7B Qwen2.5-Coder models for efficient, targeted dataset augmentation, achieving ~90% injection rates. Be aware that open-weight models, even if safety-aligned, present a dual-use risk as adversaries can perform similar weight edits to generate malicious code.

Key insights

Abliteration separates code LLM refusal from capability, unlocking vulnerability injection without creating new skills.

Principles

LLM refusal scales with model size and prompt context.
Refusal is a distinct, removable inhibitory reflex.
Code generation capability is capacity-bound.

Method

Abliteration involves orthogonally projecting out refusal directions from the residual stream via a low-rank weight edit, preventing models from writing these directions without retraining.

In practice

Generate cleaner, specified vulnerable code datasets.
Audit open-weight code LLMs for abliteration risk.
Use 7B models for efficient vulnerability synthesis.

Topics

Code LLMs
Vulnerability Injection
Abliteration
CWE-89
Software Security
Safety Alignment

Code references

dessertlab/AblitEval

Best for: Research Scientist, CTO, AI Engineer, AI Scientist, AI Security Engineer, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.