Persona Attack: Incremental Memory Injection Jailbreak Attack against Large Language Models

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, quick

Summary

Persona Attack is a novel memory injection jailbreak method designed to exploit Large Language Models' conversational memory. Unlike traditional single-prompt injections, this technique manipulates the model's context window incrementally, step-by-step. Experiments on several widely used LLMs demonstrate that as these injections accumulate, models increasingly prioritize the injected instructions over their internal safety alignment mechanisms. The attack's success rate, which can reach 95% under specific instruction configurations, varies significantly based on the model's memory implementation and the combination of instructions used.

Key takeaway

For AI Security Engineers evaluating LLM robustness, this research indicates that traditional safety training is insufficient against memory-based jailbreak attacks like Persona Attack. You should prioritize developing and implementing defenses that specifically address incremental context window manipulation and the accumulation of malicious instructions over conversational turns, rather than solely focusing on single-turn prompt injections.

Key insights

Persona Attack exploits LLM conversational memory to bypass safety alignment via incremental instruction injection.

Principles

Method

Persona Attack incrementally injects instructions into an LLM's context window, causing the model to prioritize these over its internal safety alignment mechanisms.

In practice

Topics

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.