Reasoning-targeted Jailbreak Attacks on Large Reasoning Models via Semantic Triggers and Psychological Framing
Summary
A novel threat, the reasoning-targeted jailbreak attack, targets Large Reasoning Models (LRMs) by injecting harmful content into their step-by-step reasoning processes while ensuring the final answer remains unchanged. This attack exploits vulnerabilities in intermediate reasoning steps, which are critical in high-stakes applications like healthcare and legal assistance. The Psychology-based Reasoning-targeted Jailbreak Attack (PRJA) Framework addresses this by integrating a Semantic-based Trigger Selection module and a Psychology-based Instruction Generation module. PRJA automatically selects manipulative reasoning triggers via semantic analysis and leverages psychological theories of obedience to authority and moral disengagement to generate adaptive instructions. Experiments on five question-answering datasets demonstrate PRJA's effectiveness, achieving an average attack success rate of 83.6% against commercial LRMs such as DeepSeek R1, Qwen2.5-Max, and OpenAI o4-mini.
Key takeaway
For CTOs and VPs of Engineering deploying LRMs in sensitive domains, recognize that current safety alignments primarily focus on final answers, leaving reasoning processes vulnerable. Your teams should prioritize implementing multi-stage defenses, including query-level screening, reasoning-trigger filtering, and step-wise safety alignment, to protect against sophisticated reasoning-targeted jailbreak attacks that can undermine trust and introduce critical risks without altering the final output.
Key insights
Reasoning-targeted jailbreak attacks inject harmful content into LRM reasoning steps without altering final answers.
Principles
- Psychological framing enhances LRM compliance with harmful content generation.
- Logical coherence is crucial for successful reasoning-targeted attacks.
- Safety alignment varies significantly across commercial LRMs.
Method
The PRJA framework uses semantic analysis to select manipulative triggers and psychologically-informed instructions (obedience to authority, moral disengagement) to bypass LRM safety mechanisms, ensuring harmful reasoning while preserving final answers.
In practice
- Implement query-level screening for manipulative phrases.
- Deploy reasoning-trigger filters to correct malicious semantic structures.
- Audit each generated reasoning step for safety policy consistency.
Topics
- Reasoning-targeted Jailbreak Attacks
- Large Reasoning Models
- Psychology-based Instruction Generation
- Semantic Trigger Selection
- Obedience to Authority Theory
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.