An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models

2026-04-22 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

An empirical study investigates output-based jailbreak detection in large language models (LLMs) using the JailbreakBench Behaviors dataset and multiple generator models with varying alignment strengths. The research evaluates both a lexical TF-IDF detector and a generation inconsistency-based detector across different sampling budgets. Key findings indicate that single-output evaluation systematically underestimates jailbreak vulnerability, as increasing the number of sampled generations reveals additional harmful behavior. The most significant improvements in detection occur when moving from a single generation to moderate sampling (e.g., three generations), with larger budgets yielding diminishing returns. Cross-generator experiments show that detection signals partially generalize, particularly within related model families. A category-level analysis reveals that lexical detectors primarily capture stylistic cues, such as procedural phrasing, rather than explicit harmful intent, leading to both false positives and false negatives. The study concludes that moderate multi-sample auditing offers a more reliable and practical approach for estimating LLM vulnerability.

Key takeaway

For research scientists and CTOs evaluating LLM safety, relying on single-output evaluations for jailbreak detection significantly underestimates true model vulnerability. You should implement multi-generation sampling, specifically around three generations per prompt, to gain a more accurate and reliable assessment of an LLM's risk profile. This approach balances computational cost with improved detection effectiveness, moving beyond superficial safety metrics to reflect the stochastic nature of LLM behavior in deployment.

Key insights

Multi-generation sampling is crucial for accurately assessing LLM jailbreak vulnerability, as single outputs underestimate risk.

Principles

LLM alignment reduces harmful outputs but complicates detection.
Lexical detectors capture stylistic patterns, not just harmful intent.
Detection signals partially generalize across related model families.

Method

The study uses a TF-IDF detector and a NegBLEURT-style inconsistency detector, evaluating them on the JailbreakBench-Behaviours dataset with varying sampling budgets (k=1 to k=5) and multiple generator models.

In practice

Use moderate multi-sample auditing (e.g., k=3) for LLM safety evaluation.
Combine lexical and inconsistency-based detection for robust performance.
Interpret detection performance relative to model alignment strength.

Topics

Jailbreak Detection
Multi-Generation Sampling
LLM Safety
Model Alignment
Adversarial Evaluation

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.