Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover

2024-10-15 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

This research investigates the "jailbreak scaling laws" for large language models (LLMs), demonstrating how adversarial prompt injection can escalate attack success rates from polynomial to exponential growth with increased inference-time samples. The authors propose a theoretical generative model, SpinLLM, based on spin-glass theory operating in a replica-symmetry-breaking regime, where language generations are viewed as spin configurations and unsafe outputs correspond to low-energy clusters. This model analytically derives and empirically confirms that short injected prompts lead to power-law scaling of attack success rate, while longer prompts (representing a stronger "magnetic field" towards unsafe clusters) result in exponential scaling. Experiments on models like GPT-4.5 Turbo, Vicuna-7B v1.5, Llama-3-8B-Instruct, and Llama-3.2-3B-Instruct, using the walledai/AdvBench dataset and Mistral-7B-Instruct-v0.3 as a judge, validate these theoretical predictions, showing that increased prompt injection length correlates with higher attack susceptibility and reduced reasoning ability.

Key takeaway

For AI/ML Directors evaluating LLM security, this research highlights a critical vulnerability: the attack success rate of jailbreaking can increase exponentially with the number of inference-time samples, especially with longer adversarial prompts. You should prioritize robust defense mechanisms that are resilient to both increased sampling and extended prompt injection, rather than relying solely on initial safety alignments. Consider implementing continuous monitoring for adversarial prompt patterns and developing dynamic defenses that adapt to evolving attack strategies.

Key insights

Adversarial prompt injection transforms LLM jailbreaking success from polynomial to exponential scaling with inference-time samples.

Principles

Longer prompts create stronger adversarial "magnetic fields."
Increased adversarial order reduces model reasoning ability.

Method

A spin-glass-based generative model (SpinLLM) with a "misalignment field" is used to theoretically analyze and predict inference-time attack success rates, validated empirically on LLMs.

In practice

Attack success rate can be amplified by multiple inference-time samples.
Prompt length directly influences jailbreaking effectiveness.

Topics

Jailbreaking LLMs
Prompt Injection Attacks
Spin-Glass Theory
Attack Success Rate Scaling
Replica Symmetry Breaking

Code references

I-Halder/Jailbreak-Scaling-Laws-for-Large-Language-Models

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Scientist, AI Security Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.