Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models
Summary
LOCA, a novel method, provides Local, CAusal explanations for why Large Language Models (LLMs) succeed in responding to jailbreak prompts, despite safety training. Unlike prior global explanation methods, LOCA identifies a minimal set of interpretable, intermediate representation changes that causally induce model refusal on an otherwise successful jailbreak request. The method employs activation patching and an iterative algorithm to make token-specific changes along Sparse Autoencoder (SAE) concept vectors. Evaluated on Gemma-2-2B-IT and Llama-3.1-8B-Instruct chat models using the WhatFeatures dataset, LOCA successfully induced refusal with an average of six interpretable changes on Llama and 12-16 on Gemma, significantly outperforming prior methods that often failed even after 20 changes. The study also found that early layers rely on instruction tokens, while later layers emphasize post-instruction and punctuation tokens for refusal determination.
Key takeaway
For research scientists investigating LLM safety and interpretability, LOCA offers a powerful tool to understand and mitigate jailbreak vulnerabilities. You should consider integrating LOCA's iterative, token-specific activation patching approach to gain fine-grained, causal insights into why specific jailbreaks succeed. This can inform more robust alignment strategies by revealing the minimal changes needed to restore refusal behavior, particularly by analyzing the shifting importance of instruction versus post-instruction tokens across model layers.
Key insights
LOCA provides minimal, local, and causal explanations for LLM jailbreak success by identifying key intermediate representation changes.
Principles
- Jailbreak success is nuanced, requiring local, sample-specific explanations.
- Iterative, token-specific interventions are crucial for effective refusal induction.
- Refusal signals shift from instruction to post-instruction tokens across layers.
Method
LOCA iteratively applies activation patching along SAE concept vectors, using a token-specific first-order approximation of the patching effect to identify minimal changes that induce refusal in LLMs.
In practice
- Use LOCA to pinpoint specific vulnerabilities in LLM safety alignment.
- Focus on early-layer instruction tokens for initial refusal interventions.
- Examine later-layer punctuation and post-instruction tokens for deeper analysis.
Topics
- Large Language Models
- Jailbreak Attacks
- Mechanistic Interpretability
- Sparse Autoencoders
- Activation Patching
Best for: Research Scientist, AI Scientist, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.