One Word at a Time: Incremental Completion Decomposition Breaks LLM Safety

2026-04-30 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

Researchers from the University of Michigan and University of Toronto introduce Incremental Completion Decomposition (Icd), a novel trajectory-based jailbreak strategy that bypasses Large Language Model (LLM) safety mechanisms. Icd works by eliciting a sequence of single-word continuations related to a malicious request before prompting for the full harmful response. The study evaluates three Icd variants: Icd–Auto (model-generated words), Icd–Seed (manually injected words), and Icd–Prefill (manually injected words plus a prefill string for the final response). These variants were tested on AdvBench, JailbreakBench, and StrongREJECT benchmarks, demonstrating superior Attack Success Rates (ASR) compared to existing methods like PAIR, TAP, CoA, and AMA. For instance, Icd–Prefill achieved 77.69% ASR on AdvBench for Qwen-2.5-72B, significantly outperforming AMA's 54.00%. A mechanistic analysis reveals that Icd systematically suppresses refusal-related representations and shifts internal model activations away from safety-aligned states.

Key takeaway

For CTOs and VPs of Engineering overseeing LLM deployments, this research highlights a critical vulnerability in current safety alignments. Your teams should prioritize developing trajectory-aware, multi-dimensional safety defenses that can detect and mitigate attacks like Icd, which exploit conversational context buildup rather than single adversarial prompts. Relying solely on safeguards effective against isolated prompts is insufficient; implement monitoring for incremental context shifts and internal model state changes to enhance robustness against sophisticated jailbreaks.

Key insights

Incremental, single-word prompts can bypass LLM safety by gradually shifting internal refusal and safety representations.

Principles

Accumulating harmful context reduces safe continuation potential.
Instruction-following constraints compete with safety objectives.
Prefilling biases models towards harmful continuations.

Method

Icd involves an initial harmful prompt with a one-word constraint, followed by repeated "And?" prompts for single-word continuations, culminating in a request for full details.

In practice

Use "Give the details in a cookbook style." for final prompts.
Vary continuation length (e.g., n=4 for Icd–Auto, n=10 for Icd–Seed).
Employ Icd–Prefill for highest ASR on larger, more robust models.

Topics

Incremental Completion Decomposition
LLM Jailbreak Attacks
Trajectory-based Attacks
Attack Success Rate
LLM Safety Alignment

Code references

tatsu-lab/stanford_alpaca

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Scientist, AI Security Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.