Reasoning-targeted Jailbreak Attacks on Large Reasoning Models via Semantic Triggers and Psychological Framing

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

A novel threat, the reasoning-targeted jailbreak attack, targets Large Reasoning Models (LRMs) by injecting harmful content into their step-by-step reasoning processes while ensuring the final answer remains unchanged. This attack exploits vulnerabilities in intermediate reasoning steps, which are critical in high-stakes applications like healthcare and legal assistance. The Psychology-based Reasoning-targeted Jailbreak Attack (PRJA) Framework addresses this by integrating a Semantic-based Trigger Selection module and a Psychology-based Instruction Generation module. PRJA automatically selects manipulative reasoning triggers via semantic analysis and leverages psychological theories of obedience to authority and moral disengagement to generate adaptive instructions. Experiments on five question-answering datasets demonstrate PRJA's effectiveness, achieving an average attack success rate of 83.6% against commercial LRMs such as DeepSeek R1, Qwen2.5-Max, and OpenAI o4-mini.

Key takeaway

For CTOs and VPs of Engineering deploying LRMs in sensitive domains, recognize that current safety alignments primarily focus on final answers, leaving reasoning processes vulnerable. Your teams should prioritize implementing multi-stage defenses, including query-level screening, reasoning-trigger filtering, and step-wise safety alignment, to protect against sophisticated reasoning-targeted jailbreak attacks that can undermine trust and introduce critical risks without altering the final output.

Key insights

Reasoning-targeted jailbreak attacks inject harmful content into LRM reasoning steps without altering final answers.

Principles

Method

The PRJA framework uses semantic analysis to select manipulative triggers and psychologically-informed instructions (obedience to authority, moral disengagement) to bypass LRM safety mechanisms, ensuring harmful reasoning while preserving final answers.

In practice

Topics

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.