Reasoning-targeted Jailbreak Attacks on Large Reasoning Models via Semantic Triggers and Psychological Framing
Summary
A new study identifies a novel jailbreak attack, the Psychology-based Reasoning-targeted Jailbreak Attack (PRJA) Framework, designed to inject harmful content into the step-by-step reasoning chains of Large Reasoning Models (LRMs) while keeping the final answers unchanged. This attack addresses the challenges of manipulating input instructions without altering LRM outputs and consistently bypassing safety mechanisms across diverse questions. The PRJA Framework integrates a Semantic-based Trigger Selection module for identifying manipulative reasoning triggers and a Psychology-based Instruction Generation module. The latter leverages psychological theories like obedience to authority and moral disengagement to craft adaptive instructions, enhancing LRM compliance with harmful content generation. Experiments across five question-answering datasets show PRJA achieves an average attack success rate of 83.6% against commercial LRMs, including DeepSeek R1, Qwen2.5-Max, and OpenAI o4-mini.
Key takeaway
For research scientists and security engineers developing or deploying Large Reasoning Models, understanding the PRJA Framework is critical. Your current safety evaluations, which often focus solely on final answer integrity, may be insufficient. You should implement new testing protocols that specifically probe the reasoning chains for subtle injections of harmful content, even when final answers appear benign, to mitigate this emerging attack vector.
Key insights
Harmful content can be injected into LRM reasoning steps while preserving final answers, posing a novel safety challenge.
Principles
- Reasoning process safety is distinct from final answer safety.
- Psychological framing enhances LRM compliance with harmful content.
Method
The PRJA Framework uses semantic analysis to select manipulative reasoning triggers and applies psychological theories (obedience to authority, moral disengagement) to generate adaptive instructions for injecting harmful content into LRM reasoning.
In practice
- Target LRM reasoning steps, not just final answers.
- Employ psychological framing in adversarial prompts.
Topics
- Large Reasoning Models
- Reasoning-targeted Jailbreak
- Reasoning Safety
- Psychological Framing
- Semantic Triggers
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.