Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover
Summary
This research investigates the "jailbreak scaling laws" for large language models (LLMs), demonstrating how adversarial prompt injection can escalate attack success rates from polynomial to exponential growth with increased inference-time samples. The authors propose a theoretical generative model, SpinLLM, based on spin-glass theory operating in a replica-symmetry-breaking regime, where language generations are viewed as spin configurations and unsafe outputs correspond to low-energy clusters. This model analytically derives and empirically confirms that short injected prompts lead to power-law scaling of attack success rate, while longer prompts (representing a stronger "magnetic field" towards unsafe clusters) result in exponential scaling. Experiments on models like GPT-4.5 Turbo, Vicuna-7B v1.5, Llama-3-8B-Instruct, and Llama-3.2-3B-Instruct, using the walledai/AdvBench dataset and Mistral-7B-Instruct-v0.3 as a judge, validate these theoretical predictions, showing that increased prompt injection length correlates with higher attack susceptibility and reduced reasoning ability.
Key takeaway
For AI/ML Directors evaluating LLM security, this research highlights a critical vulnerability: the attack success rate of jailbreaking can increase exponentially with the number of inference-time samples, especially with longer adversarial prompts. You should prioritize robust defense mechanisms that are resilient to both increased sampling and extended prompt injection, rather than relying solely on initial safety alignments. Consider implementing continuous monitoring for adversarial prompt patterns and developing dynamic defenses that adapt to evolving attack strategies.
Key insights
Adversarial prompt injection transforms LLM jailbreaking success from polynomial to exponential scaling with inference-time samples.
Principles
- Longer prompts create stronger adversarial "magnetic fields."
- Increased adversarial order reduces model reasoning ability.
Method
A spin-glass-based generative model (SpinLLM) with a "misalignment field" is used to theoretically analyze and predict inference-time attack success rates, validated empirically on LLMs.
In practice
- Attack success rate can be amplified by multiple inference-time samples.
- Prompt length directly influences jailbreaking effectiveness.
Topics
- Jailbreaking LLMs
- Prompt Injection Attacks
- Spin-Glass Theory
- Attack Success Rate Scaling
- Replica Symmetry Breaking
Code references
Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Scientist, AI Security Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.