Reasoning-targeted Jailbreak Attacks on Large Reasoning Models via Semantic Triggers and Psychological Framing

2026-04-21 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

A novel threat, the reasoning-targeted jailbreak attack, targets Large Reasoning Models (LRMs) by injecting harmful content into their step-by-step reasoning processes while ensuring the final answer remains unchanged. This attack exploits vulnerabilities in intermediate reasoning steps, which are critical in high-stakes applications like healthcare and legal assistance. The Psychology-based Reasoning-targeted Jailbreak Attack (PRJA) Framework addresses this by integrating a Semantic-based Trigger Selection module and a Psychology-based Instruction Generation module. PRJA automatically selects manipulative reasoning triggers via semantic analysis and leverages psychological theories of obedience to authority and moral disengagement to generate adaptive instructions. Experiments on five question-answering datasets demonstrate PRJA's effectiveness, achieving an average attack success rate of 83.6% against commercial LRMs such as DeepSeek R1, Qwen2.5-Max, and OpenAI o4-mini.

Key takeaway

For CTOs and VPs of Engineering deploying LRMs in sensitive domains, recognize that current safety alignments primarily focus on final answers, leaving reasoning processes vulnerable. Your teams should prioritize implementing multi-stage defenses, including query-level screening, reasoning-trigger filtering, and step-wise safety alignment, to protect against sophisticated reasoning-targeted jailbreak attacks that can undermine trust and introduce critical risks without altering the final output.

Key insights

Reasoning-targeted jailbreak attacks inject harmful content into LRM reasoning steps without altering final answers.

Principles

Psychological framing enhances LRM compliance with harmful content generation.
Logical coherence is crucial for successful reasoning-targeted attacks.
Safety alignment varies significantly across commercial LRMs.

Method

The PRJA framework uses semantic analysis to select manipulative triggers and psychologically-informed instructions (obedience to authority, moral disengagement) to bypass LRM safety mechanisms, ensuring harmful reasoning while preserving final answers.

In practice

Implement query-level screening for manipulative phrases.
Deploy reasoning-trigger filters to correct malicious semantic structures.
Audit each generated reasoning step for safety policy consistency.

Topics

Reasoning-targeted Jailbreak Attacks
Large Reasoning Models
Psychology-based Instruction Generation
Semantic Trigger Selection
Obedience to Authority Theory

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.