THRD: A Training-Free Multi-Turn Defense Framework for Jailbreak Attacks on Large Language Models
Summary
THRD is a novel training-free framework designed to defend Large Language Models (LLMs) against multi-turn jailbreak attacks, which exploit conversational dynamics like gradual escalation. Unlike existing defenses that either require costly retraining or perform single-turn analysis, THRD explicitly models temporal risk accumulation. The framework integrates four modules: a Turn-level Risk Assessor (TRA) for instantaneous risk estimation, a Historical Context Analyzer (HCA) for cross-turn intent escalation detection, a Response Evaluator (RE) for identifying facilitative outputs, and a Decision Module that combines these signals. Experiments against state-of-the-art multi-turn attacks demonstrate THRD's effectiveness, reducing the Attack Success Rate (ASR) to 0.2-4.0% while maintaining LLM utility with less than 1.5% degradation on MMLU and GSM8K benchmarks. Analysis indicates over 70% of multi-turn attacks are detected at Turn~2 or later, validating THRD's multi-turn approach.
Key takeaway
For AI Security Engineers and Machine Learning Engineers tasked with defending LLMs against sophisticated multi-turn jailbreak attacks, you should integrate defense frameworks that explicitly model temporal risk accumulation. Relying solely on single-turn analysis is insufficient, as over 70% of attacks manifest after the first turn. Consider adopting a multi-module approach like THRD to achieve ASR reductions to 0.2-4.0% while preserving model utility, ensuring robust conversational AI safety.
Key insights
Safety in multi-turn LLM interactions is trajectory-dependent, requiring defenses that model temporal risk accumulation.
Principles
- Dialogue history continuously reshapes LLM context.
- Single-turn analysis is insufficient for multi-turn attacks.
- Explicit temporal aggregation is crucial for detection.
Method
THRD integrates TRA, HCA, RE, and a Decision Module using a time-evolving scoring mechanism with attenuation-based modulation and trend-aware adjustment to detect multi-turn jailbreaks.
In practice
- Implement multi-module defense for LLM safety.
- Prioritize temporal risk accumulation in LLM security.
- Validate multi-turn detection beyond Turn 1.
Topics
- Multi-turn Jailbreak Attacks
- Large Language Models
- Training-Free Defense
- Temporal Risk Accumulation
- Attack Success Rate
- Conversational AI Security
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.