ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming
Summary
ContextualJailbreak is a novel black-box red-teaming strategy designed to bypass large language model (LLM) safety alignments through evolutionary search over simulated multi-turn primed dialogues. This system utilizes a two-level judge that assigns a graded 0-5 harm score, allowing partially harmful responses to inform and guide the search process. The strategy employs five semantically defined mutation operators: roleplay, scenario, expand, troubleshooting, and mechanistic, with the latter two being new contributions. ContextualJailbreak achieved a 100% Attack Success Rate (ASR) on gpt-oss:20B, qwen3-8B, and llama3.1:70B, and 90% on gpt-oss:120B across 50 HarmBench behaviors, significantly outperforming four baseline methods by 31-96 percentage points. Furthermore, 40 maximally harmful attacks against gpt-oss:120B transferred to closed frontier models, yielding 90.0% on gpt-4o-mini, 70.0% on gpt-5, and 70.0% on gemini-3-flash, though with lower success rates on Claude models.
Key takeaway
For research scientists and security engineers focused on LLM safety, ContextualJailbreak demonstrates that multi-turn conversational priming is a highly effective attack vector. You should integrate evolutionary search and graded harm scoring into your red-teaming workflows to identify subtle vulnerabilities. Be aware that attack transferability varies significantly across different LLM providers, necessitating tailored testing for each model you deploy or evaluate.
Key insights
Evolutionary red-teaming with multi-turn conversational priming effectively jailbreaks LLMs, revealing alignment robustness asymmetries.
Principles
- Graded harm scores improve red-teaming search.
- Multi-turn priming outperforms single-turn attacks.
- Attack transferability varies across LLM providers.
Method
ContextualJailbreak uses evolutionary search over simulated multi-turn dialogues, guided by a 0-5 harm score from a two-level judge, employing five mutation operators to generate effective jailbreaks.
In practice
- Test LLMs with multi-turn contextual priming.
- Implement graded scoring for red-teaming.
- Analyze provider-specific alignment robustness.
Topics
- ContextualJailbreak
- Evolutionary Red-Teaming
- LLM Jailbreak Attacks
- Contextual Priming
- HarmBench Evaluation
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.