Refusal Before Decoding: Detecting and Exploiting Refusal Signals in Intermediate LLM Activations
Summary
A new study investigates the predictability of refusal behavior in Large Language Models (LLMs) from their intermediate activations, prior to output decoding. Researchers found that refusal signals are linearly decodable well before the final layer, using linear probes trained on residual stream activations at each transformer block. This indicates that safety-relevant behavior is encoded early in the LLM's processing. To demonstrate actionability, the study introduces Mechanistic AutoDAN, a probe-guided version of AutoDAN. This variant replaces full-model fitness evaluation with partial forward passes and probe-based scoring within a genetic prompt search loop. Mechanistic AutoDAN achieves attack success rates competitive with vanilla AutoDAN, while significantly reducing per-iteration search time by up to 72%. Furthermore, its probe-guided prompts match or exceed AutoDAN's cross-model transferability in several configurations, with the utility of probe guidance increasing with model scale. The findings confirm that refusal is a structured and actionable signal within intermediate LLM activations.
Key takeaway
For AI Security Engineers developing or evaluating LLM safety, this research offers a critical shift. You should explore integrating linear probes to detect refusal signals directly from intermediate activations, rather than relying solely on final output. This approach can significantly reduce the time and computational cost of adversarial prompt generation by up to 72%, enabling faster iteration and more robust safety testing. Consider adopting probe-guided methods like Mechanistic AutoDAN to enhance the efficiency and transferability of your jailbreaking detection or alignment efforts.
Key insights
LLM refusal behavior is linearly decodable from intermediate activations, offering an actionable signal for safety.
Principles
- Safety-relevant LLM behavior manifests early in processing.
- Intermediate activations provide structured, actionable signals.
- Probe guidance enhances adversarial attack efficiency and transfer.
Method
Mechanistic AutoDAN employs linear probes on residual stream activations for partial forward passes and probe-based scoring within a genetic prompt search loop, replacing full-model fitness evaluation.
In practice
- Implement linear probes to monitor refusal signals in LLM intermediate layers.
- Integrate probe-based scoring for efficient adversarial prompt generation.
- Utilize intermediate activation analysis for faster safety evaluations.
Topics
- LLM Refusal Behavior
- Intermediate Activations
- Linear Probes
- Mechanistic Interpretability
- Adversarial Attacks
- LLM Safety
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.