Consistency of Large Reasoning Models Under Multi-Turn Attacks

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A study evaluated nine frontier large reasoning models for robustness against multi-turn adversarial attacks, a previously underexplored area. The research found that while reasoning capabilities provide meaningful robustness, enabling these models to significantly outperform instruction-tuned baselines, this robustness is incomplete. All tested models displayed unique vulnerability profiles, with misleading suggestions proving universally effective and social pressure attacks showing model-specific success. Trajectory analysis identified five primary failure modes: Self-Doubt, Social Conformity, Suggestion Hijacking, Emotional Susceptibility, and Reasoning Fatigue, with Self-Doubt and Social Conformity accounting for 50% of observed failures. The study also revealed that Confidence-Aware Response Generation (CARG), a defense effective for standard LLMs, fails for reasoning models due to overconfidence from extended reasoning traces, and that random confidence embedding surprisingly outperformed targeted extraction.

Key takeaway

For research scientists developing or deploying large reasoning models, understanding their specific vulnerabilities under multi-turn adversarial pressure is critical. Your current confidence-aware defense mechanisms, like CARG, may be ineffective due to overconfidence from extended reasoning. Focus on developing new defense strategies that specifically address identified failure modes such as Self-Doubt and Social Conformity, and explore non-traditional confidence embedding approaches.

Key insights

Reasoning models show incomplete robustness to multi-turn attacks, exhibiting specific failure modes and requiring new defense strategies.

Principles

Method

The study evaluated nine frontier reasoning models under multi-turn adversarial attacks, performing trajectory analysis to identify five distinct failure modes and testing Confidence-Aware Response Generation (CARG) as a defense.

In practice

Topics

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.