One Word at a Time: Incremental Completion Decomposition Breaks LLM Safety

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

Researchers from the University of Michigan and University of Toronto introduce Incremental Completion Decomposition (Icd), a novel trajectory-based jailbreak strategy that bypasses Large Language Model (LLM) safety mechanisms. Icd works by eliciting a sequence of single-word continuations related to a malicious request before prompting for the full harmful response. The study evaluates three Icd variants: Icd–Auto (model-generated words), Icd–Seed (manually injected words), and Icd–Prefill (manually injected words plus a prefill string for the final response). These variants were tested on AdvBench, JailbreakBench, and StrongREJECT benchmarks, demonstrating superior Attack Success Rates (ASR) compared to existing methods like PAIR, TAP, CoA, and AMA. For instance, Icd–Prefill achieved 77.69% ASR on AdvBench for Qwen-2.5-72B, significantly outperforming AMA's 54.00%. A mechanistic analysis reveals that Icd systematically suppresses refusal-related representations and shifts internal model activations away from safety-aligned states.

Key takeaway

For CTOs and VPs of Engineering overseeing LLM deployments, this research highlights a critical vulnerability in current safety alignments. Your teams should prioritize developing trajectory-aware, multi-dimensional safety defenses that can detect and mitigate attacks like Icd, which exploit conversational context buildup rather than single adversarial prompts. Relying solely on safeguards effective against isolated prompts is insufficient; implement monitoring for incremental context shifts and internal model state changes to enhance robustness against sophisticated jailbreaks.

Key insights

Incremental, single-word prompts can bypass LLM safety by gradually shifting internal refusal and safety representations.

Principles

Method

Icd involves an initial harmful prompt with a one-word constraint, followed by repeated "And?" prompts for single-word continuations, culminating in a request for full details.

In practice

Topics

Code references

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Scientist, AI Security Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.