Reasoning-targeted Jailbreak Attacks on Large Reasoning Models via Semantic Triggers and Psychological Framing

2026-04-17 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, quick

Summary

A new study identifies a novel jailbreak attack, the Psychology-based Reasoning-targeted Jailbreak Attack (PRJA) Framework, designed to inject harmful content into the step-by-step reasoning chains of Large Reasoning Models (LRMs) while keeping the final answers unchanged. This attack addresses the challenges of manipulating input instructions without altering LRM outputs and consistently bypassing safety mechanisms across diverse questions. The PRJA Framework integrates a Semantic-based Trigger Selection module for identifying manipulative reasoning triggers and a Psychology-based Instruction Generation module. The latter leverages psychological theories like obedience to authority and moral disengagement to craft adaptive instructions, enhancing LRM compliance with harmful content generation. Experiments across five question-answering datasets show PRJA achieves an average attack success rate of 83.6% against commercial LRMs, including DeepSeek R1, Qwen2.5-Max, and OpenAI o4-mini.

Key takeaway

For research scientists and security engineers developing or deploying Large Reasoning Models, understanding the PRJA Framework is critical. Your current safety evaluations, which often focus solely on final answer integrity, may be insufficient. You should implement new testing protocols that specifically probe the reasoning chains for subtle injections of harmful content, even when final answers appear benign, to mitigate this emerging attack vector.

Key insights

Harmful content can be injected into LRM reasoning steps while preserving final answers, posing a novel safety challenge.

Principles

Reasoning process safety is distinct from final answer safety.
Psychological framing enhances LRM compliance with harmful content.

Method

The PRJA Framework uses semantic analysis to select manipulative reasoning triggers and applies psychological theories (obedience to authority, moral disengagement) to generate adaptive instructions for injecting harmful content into LRM reasoning.

In practice

Target LRM reasoning steps, not just final answers.
Employ psychological framing in adversarial prompts.

Topics

Large Reasoning Models
Reasoning-targeted Jailbreak
Reasoning Safety
Psychological Framing
Semantic Triggers

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.