Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models

2026-05-21 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

A novel jailbreak method, Attention-Guided Reward (AGR), has been developed for Large Reasoning Models (LRMs), demonstrating significantly higher attack success rates (ASR). The research reveals that successful LRM jailbreaks correlate with lower attention to harmful tokens in the input prompt but higher attention to them in the reasoning content. AGR leverages this finding by employing a reinforcement learning (RL) framework with a reward function explicitly optimizing for this attention pattern. It also expands the RL action space with diverse persuasion strategies. Experiments on five LRMs (Qwen3-1.7B, Qwen3-8B, DeepSeek-R1-Distill-Llama-8B, o4-mini, Gemini-2.5-Flash) across three benchmarks (AdvBench, StrongReject, HarmBench) show AGR outperforms existing methods in effectiveness (up to 98.0% ASR), efficiency (1.55-1.71 Average Successful Turns), and transferability, while remaining robust against defenses like SmoothLLM and Llama-Guard-3.

Key takeaway

For AI Security Engineers assessing LRM vulnerabilities, this research highlights a critical new attack vector. You should prioritize monitoring attention patterns within LRMs, particularly the inverse correlation between input prompt and reasoning content attention to harmful tokens. Implement robust defenses that specifically target these internal reasoning dynamics, as traditional input filters and external safety classifiers are less effective against AGR's stealthy, attention-guided prompt refinements. Proactive red-teaming with attention-aware methods is now essential.

Key insights

Successful LRM jailbreaks correlate with specific attention patterns, which can be optimized via RL for higher attack rates.

Principles

Lower input prompt attention to harmful tokens aids jailbreaking.
Higher reasoning content attention to harmful tokens enhances jailbreak success.
Attention patterns serve as strong discriminative signals for jailbreak outcomes.

Method

An RL framework with an attention-guided reward function, derived from a linear SVM on attention proportions ($AP_p$, $AP_r$), optimizes prompt transformations. It uses a 17-action space including cognitive persuasion strategies.

In practice

Use $AP_p$ and $AP_r$ to quantify jailbreak-related attention.
Employ diverse persuasion strategies to expand RL action space.
Train a linear SVM on attention patterns for reward signal.

Topics

Large Reasoning Models
Jailbreak Attacks
Reinforcement Learning
Attention Mechanisms
AI Security
Adversarial Prompts

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.