Adaptive Prompt Embedding Optimization for LLM Jailbreaking

2023-03-30 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, medium

Summary

Prompt Embedding Optimization (PEO) is a novel multi-round white-box jailbreak attack designed for aligned Large Language Models (LLMs). Unlike traditional methods that append discrete adversarial suffixes, PEO directly optimizes the continuous embeddings of existing prompt tokens. This approach preserves the visible prompt string exactly, with 0% text change after nearest-token projection, and ensures responses largely remain on topic. PEO integrates continuous embedding-space optimization with structured continuation targets and an adaptive, failure-focused schedule. It demonstrates superior performance against competing white-box attacks like nanoGCG, SPT, and BEAST across two standard harmful-behavior benchmarks (AdvBench and HarmBench text-test), as measured by ASR-Judge. The research also challenges the assumption that perturbing prompt embeddings inherently destroys semantic content.

Key takeaway

For AI/ML security researchers and red teamers evaluating LLM vulnerabilities, PEO demonstrates that direct prompt embedding optimization is a highly effective and stealthy jailbreaking technique. You should consider this method for stress-testing safety alignments, as it outperforms token-appending attacks while maintaining prompt integrity. Your evaluations should prioritize LLM-as-a-judge metrics over simple string heuristics for accurate assessment of harmful content.

Key insights

Optimizing existing prompt embeddings can jailbreak LLMs while preserving visible text and semantic content.

Principles

Direct embedding optimization preserves prompt semantics.
Adaptive, multi-round schedules enhance attack effectiveness.
LLM judges are superior for evaluating harmfulness over string heuristics.

Method

PEO uses continuous embedding-space optimization, structured continuation targets, and an adaptive failure-focused schedule to perturb existing prompt token embeddings, increasing the likelihood of harmful continuations.

In practice

Focus on embedding-space attacks for stealthy jailbreaks.
Implement multi-round optimization for improved success rates.
Utilize LLM-as-a-judge for robust safety evaluation.

Topics

Prompt Embedding Optimization
LLM Jailbreaking
Adversarial Attacks
Safety Alignment
White-box Attacks

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Scientist, AI Security Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.