DoubtProbe: Black-Box Jailbreak Defense via Structural Verification and Semantic Auditing

2026-06-15 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, quick

Summary

DoubtProbe is a novel dual-branch inference-time defense framework designed to counter black-box jailbreaks in large language models (LLMs). It addresses the challenge that many jailbreaks reorganize harmful information rather than removing it, thereby evading safety alignment. DoubtProbe integrates structural verification with semantic auditing, framing black-box jailbreak defense as consistency checking under controlled transformation. Its structural branch extracts a structured representation from an original request, reconstructs it under constraints, and identifies information-preservation failures. Concurrently, the semantic branch directly audits the original prompt. Benchmarking against existing defenses, DoubtProbe demonstrated a stronger and more stable defense-utility trade-off. On Qwen2.5-72B, it reduced the JBB attack success rate from 0.293 to 0.100 and CodeAttack from 0.152 to 0.001, while maintaining low false positive rates of 0.022 on AlpacaEval and 0.016 on OR-Bench. These performance gains remained stable when transferred to Llama-3.1-70B, highlighting the practical and generalizable nature of structural inconsistency signals for defense.

Key takeaway

For AI Security Engineers deploying user-facing LLMs, black-box jailbreak defense is critical. You should consider integrating dual-branch inference-time defenses like DoubtProbe, which combine structural verification with semantic auditing. This approach effectively detects reorganized harmful content, significantly reducing attack success rates on models like Qwen2.5-72B and Llama-3.1-70B while maintaining low false positives. Prioritize defenses that check for information-preservation failures and semantic inconsistencies to enhance LLM safety.

Key insights

DoubtProbe defends against black-box jailbreaks by detecting structural inconsistencies and semantically auditing prompts during LLM inference.

Principles

Black-box jailbreaks often reorganize harmful goals.
Consistency checking can detect reorganized harmful content.
Structural inconsistency signals are generalizable.

Method

DoubtProbe uses a dual-branch approach: structural verification extracts and reconstructs requests to detect information-preservation failures, while semantic auditing directly checks the original prompt.

In practice

Combine structural and semantic checks for robust defense.
Evaluate defense stability across different LLM backbones.
Focus on information-preservation failures in prompts.

Topics

Black-Box Jailbreak Defense
Large Language Models
Structural Verification
Semantic Auditing
LLM Security
Qwen2.5-72B

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.