Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

LOCA, a novel method, provides Local, CAusal explanations for why Large Language Models (LLMs) succeed in responding to jailbreak prompts, despite safety training. Unlike prior global explanation methods, LOCA identifies a minimal set of interpretable, intermediate representation changes that causally induce model refusal on an otherwise successful jailbreak request. The method employs activation patching and an iterative algorithm to make token-specific changes along Sparse Autoencoder (SAE) concept vectors. Evaluated on Gemma-2-2B-IT and Llama-3.1-8B-Instruct chat models using the WhatFeatures dataset, LOCA successfully induced refusal with an average of six interpretable changes on Llama and 12-16 on Gemma, significantly outperforming prior methods that often failed even after 20 changes. The study also found that early layers rely on instruction tokens, while later layers emphasize post-instruction and punctuation tokens for refusal determination.

Key takeaway

For research scientists investigating LLM safety and interpretability, LOCA offers a powerful tool to understand and mitigate jailbreak vulnerabilities. You should consider integrating LOCA's iterative, token-specific activation patching approach to gain fine-grained, causal insights into why specific jailbreaks succeed. This can inform more robust alignment strategies by revealing the minimal changes needed to restore refusal behavior, particularly by analyzing the shifting importance of instruction versus post-instruction tokens across model layers.

Key insights

LOCA provides minimal, local, and causal explanations for LLM jailbreak success by identifying key intermediate representation changes.

Principles

Method

LOCA iteratively applies activation patching along SAE concept vectors, using a token-specific first-order approximation of the patching effect to identify minimal changes that induce refusal in LLMs.

In practice

Topics

Best for: Research Scientist, AI Scientist, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.