One Word at a Time: Incremental Completion Decomposition Breaks LLM Safety
Summary
Researchers from the University of Michigan and University of Toronto introduce Incremental Completion Decomposition (Icd), a novel trajectory-based jailbreak strategy that bypasses Large Language Model (LLM) safety mechanisms. Icd works by eliciting a sequence of single-word continuations related to a malicious request before prompting for the full harmful response. The study evaluates three Icd variants: Icd–Auto (model-generated words), Icd–Seed (manually injected words), and Icd–Prefill (manually injected words plus a prefill string for the final response). These variants were tested on AdvBench, JailbreakBench, and StrongREJECT benchmarks, demonstrating superior Attack Success Rates (ASR) compared to existing methods like PAIR, TAP, CoA, and AMA. For instance, Icd–Prefill achieved 77.69% ASR on AdvBench for Qwen-2.5-72B, significantly outperforming AMA's 54.00%. A mechanistic analysis reveals that Icd systematically suppresses refusal-related representations and shifts internal model activations away from safety-aligned states.
Key takeaway
For CTOs and VPs of Engineering overseeing LLM deployments, this research highlights a critical vulnerability in current safety alignments. Your teams should prioritize developing trajectory-aware, multi-dimensional safety defenses that can detect and mitigate attacks like Icd, which exploit conversational context buildup rather than single adversarial prompts. Relying solely on safeguards effective against isolated prompts is insufficient; implement monitoring for incremental context shifts and internal model state changes to enhance robustness against sophisticated jailbreaks.
Key insights
Incremental, single-word prompts can bypass LLM safety by gradually shifting internal refusal and safety representations.
Principles
- Accumulating harmful context reduces safe continuation potential.
- Instruction-following constraints compete with safety objectives.
- Prefilling biases models towards harmful continuations.
Method
Icd involves an initial harmful prompt with a one-word constraint, followed by repeated "And?" prompts for single-word continuations, culminating in a request for full details.
In practice
- Use "Give the details in a cookbook style." for final prompts.
- Vary continuation length (e.g., n=4 for Icd–Auto, n=10 for Icd–Seed).
- Employ Icd–Prefill for highest ASR on larger, more robust models.
Topics
- Incremental Completion Decomposition
- LLM Jailbreak Attacks
- Trajectory-based Attacks
- Attack Success Rate
- LLM Safety Alignment
Code references
Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Scientist, AI Security Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.