ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming

2026-05-04 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, quick

Summary

ContextualJailbreak is a novel black-box red-teaming strategy designed to bypass large language model (LLM) safety alignments through evolutionary search over simulated multi-turn primed dialogues. This system utilizes a two-level judge that assigns a graded 0-5 harm score, allowing partially harmful responses to inform and guide the search process. The strategy employs five semantically defined mutation operators: roleplay, scenario, expand, troubleshooting, and mechanistic, with the latter two being new contributions. ContextualJailbreak achieved a 100% Attack Success Rate (ASR) on gpt-oss:20B, qwen3-8B, and llama3.1:70B, and 90% on gpt-oss:120B across 50 HarmBench behaviors, significantly outperforming four baseline methods by 31-96 percentage points. Furthermore, 40 maximally harmful attacks against gpt-oss:120B transferred to closed frontier models, yielding 90.0% on gpt-4o-mini, 70.0% on gpt-5, and 70.0% on gemini-3-flash, though with lower success rates on Claude models.

Key takeaway

For research scientists and security engineers focused on LLM safety, ContextualJailbreak demonstrates that multi-turn conversational priming is a highly effective attack vector. You should integrate evolutionary search and graded harm scoring into your red-teaming workflows to identify subtle vulnerabilities. Be aware that attack transferability varies significantly across different LLM providers, necessitating tailored testing for each model you deploy or evaluate.

Key insights

Evolutionary red-teaming with multi-turn conversational priming effectively jailbreaks LLMs, revealing alignment robustness asymmetries.

Principles

Graded harm scores improve red-teaming search.
Multi-turn priming outperforms single-turn attacks.
Attack transferability varies across LLM providers.

Method

ContextualJailbreak uses evolutionary search over simulated multi-turn dialogues, guided by a 0-5 harm score from a two-level judge, employing five mutation operators to generate effective jailbreaks.

In practice

Test LLMs with multi-turn contextual priming.
Implement graded scoring for red-teaming.
Analyze provider-specific alignment robustness.

Topics

ContextualJailbreak
Evolutionary Red-Teaming
LLM Jailbreak Attacks
Contextual Priming
HarmBench Evaluation

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.