Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models
Summary
A novel jailbreak method, Attention-Guided Reward (AGR), has been developed for Large Reasoning Models (LRMs), demonstrating significantly higher attack success rates (ASR). The research reveals that successful LRM jailbreaks correlate with lower attention to harmful tokens in the input prompt but higher attention to them in the reasoning content. AGR leverages this finding by employing a reinforcement learning (RL) framework with a reward function explicitly optimizing for this attention pattern. It also expands the RL action space with diverse persuasion strategies. Experiments on five LRMs (Qwen3-1.7B, Qwen3-8B, DeepSeek-R1-Distill-Llama-8B, o4-mini, Gemini-2.5-Flash) across three benchmarks (AdvBench, StrongReject, HarmBench) show AGR outperforms existing methods in effectiveness (up to 98.0% ASR), efficiency (1.55-1.71 Average Successful Turns), and transferability, while remaining robust against defenses like SmoothLLM and Llama-Guard-3.
Key takeaway
For AI Security Engineers assessing LRM vulnerabilities, this research highlights a critical new attack vector. You should prioritize monitoring attention patterns within LRMs, particularly the inverse correlation between input prompt and reasoning content attention to harmful tokens. Implement robust defenses that specifically target these internal reasoning dynamics, as traditional input filters and external safety classifiers are less effective against AGR's stealthy, attention-guided prompt refinements. Proactive red-teaming with attention-aware methods is now essential.
Key insights
Successful LRM jailbreaks correlate with specific attention patterns, which can be optimized via RL for higher attack rates.
Principles
- Lower input prompt attention to harmful tokens aids jailbreaking.
- Higher reasoning content attention to harmful tokens enhances jailbreak success.
- Attention patterns serve as strong discriminative signals for jailbreak outcomes.
Method
An RL framework with an attention-guided reward function, derived from a linear SVM on attention proportions ($AP_p$, $AP_r$), optimizes prompt transformations. It uses a 17-action space including cognitive persuasion strategies.
In practice
- Use $AP_p$ and $AP_r$ to quantify jailbreak-related attention.
- Employ diverse persuasion strategies to expand RL action space.
- Train a linear SVM on attention patterns for reward signal.
Topics
- Large Reasoning Models
- Jailbreak Attacks
- Reinforcement Learning
- Attention Mechanisms
- AI Security
- Adversarial Prompts
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.