Refusal Before Decoding: Detecting and Exploiting Refusal Signals in Intermediate LLM Activations

· Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, medium

Summary

A new study investigates the predictability of refusal behavior in Large Language Models (LLMs) from their intermediate activations, prior to output decoding. Researchers found that refusal signals are linearly decodable well before the final layer, using linear probes trained on residual stream activations at each transformer block. This indicates that safety-relevant behavior is encoded early in the LLM's processing. To demonstrate actionability, the study introduces Mechanistic AutoDAN, a probe-guided version of AutoDAN. This variant replaces full-model fitness evaluation with partial forward passes and probe-based scoring within a genetic prompt search loop. Mechanistic AutoDAN achieves attack success rates competitive with vanilla AutoDAN, while significantly reducing per-iteration search time by up to 72%. Furthermore, its probe-guided prompts match or exceed AutoDAN's cross-model transferability in several configurations, with the utility of probe guidance increasing with model scale. The findings confirm that refusal is a structured and actionable signal within intermediate LLM activations.

Key takeaway

For AI Security Engineers developing or evaluating LLM safety, this research offers a critical shift. You should explore integrating linear probes to detect refusal signals directly from intermediate activations, rather than relying solely on final output. This approach can significantly reduce the time and computational cost of adversarial prompt generation by up to 72%, enabling faster iteration and more robust safety testing. Consider adopting probe-guided methods like Mechanistic AutoDAN to enhance the efficiency and transferability of your jailbreaking detection or alignment efforts.

Key insights

LLM refusal behavior is linearly decodable from intermediate activations, offering an actionable signal for safety.

Principles

Method

Mechanistic AutoDAN employs linear probes on residual stream activations for partial forward passes and probe-based scoring within a genetic prompt search loop, replacing full-model fitness evaluation.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.