Activation-Guided Local Editing for Jailbreaking Attacks

2024-05-13 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

Researchers from Beihang University and other institutions introduce Activation-GuIded Local Editing (AGILE), a novel two-stage jailbreaking framework designed to bypass safety alignments in Large Language Models (LLMs). AGILE combines elements of token-level and prompt-level attacks, achieving up to a 37.74% gain in Attack Success Rate (ASR) over the strongest baseline. The first stage involves a generator LLM creating a scenario-based dialogue context and rephrasing the malicious query to obscure its intent. The second stage uses the target model's hidden states and attention scores to guide fine-grained text edits, such as synonym substitution and token injection, steering the model's internal representation from malicious to benign. Experiments on the HarmBench dataset across six open-source LLMs (e.g., Llama-3-8B-Instruct, Qwen-2.5-7B-Instruct) and four closed-source LLMs (e.g., GPT-4o, Claude-3.5-Sonnet) demonstrate AGILE's superior effectiveness and transferability, even against defense mechanisms like Llama-Guard.

Key takeaway

For red-teaming specialists and AI safety researchers, AGILE demonstrates that current LLM safety mechanisms remain vulnerable to sophisticated, activation-guided text manipulation. You should prioritize developing defenses that are robust against attacks leveraging internal model states and semantic obfuscation, rather than relying solely on surface-level prompt filtering or simple contextual cues. Focus on deeper, more resilient alignment techniques that can detect subtle shifts in intent, even when prompts appear benign.

Key insights

AGILE is a two-stage jailbreaking framework that uses activation-guided text editing to bypass LLM safety alignments.

Principles

Semantic structure of the final prompt is critical for attack success.
Pushing hidden states towards a "benign" subspace increases jailbreak success.
Attention scores identify optimal token editing positions.

Method

AGILE first uses a generator LLM for scenario-based context generation and query rephrasing. Then, an editing module applies attention-guided synonym substitution and token injection, optimizing hidden states to appear benign.

In practice

Use a generator LLM to create deceptive dialogue contexts.
Employ attention scores to pinpoint high-impact tokens for editing.
Train MLP classifiers to guide hidden state manipulation.

Topics

Activation-Guided Editing
LLM Jailbreaking
Adversarial Transferability
Hidden State Guidance
Large Language Models

Code references

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.