FragFuse: Bypassing Access Control of Large Language Model Agents via Memory-Based Query Fragmentation and Fusion
Summary
FragFuse is a novel attack designed to bypass access control mechanisms in large language model (LLM) agents by exploiting their long-term memory operations. This attack leverages a temporal channel where prohibited content, normally blocked by access control, is fragmented across multiple interactions. These benign-appearing fragments are stored in the agent's long-term memory and later reconstructed through memory retrieval, without the harmful content explicitly appearing in the final user query. FragFuse operates in three stages: identifying rejection-responsive fragments via black-box adaptive querying, injecting these fragments into memory using marker carrier queries, and then retrieving and fusing them with a follow-up attack query. An automated surrogate-based optimization scheme further tunes fusion instructions and marker designs for automated attack generation. Evaluated across four agent settings and three representative access-control mechanisms, FragFuse achieved an 86.3% average bypass success rate and a 41.1% average end-to-end harmful task success rate, with only 4.4% task-success degradation. Existing prompt-injection and perplexity detectors proved ineffective against this attack.
Key takeaway
For AI Security Engineers deploying LLM agents with access control, you must recognize that current mechanisms are highly vulnerable to memory-based query fragmentation attacks like FragFuse. Your existing prompt-injection and perplexity detectors are ineffective. You should urgently re-evaluate your agent's security posture, focusing on memory interaction channels, and develop new defenses that prevent the injection and reconstruction of fragmented harmful content within long-term memory. This vulnerability achieves an 86.3% bypass rate, demanding immediate attention.
Key insights
LLM agent long-term memory introduces a novel attack surface, FragFuse, enabling access control bypass via fragmented content injection and reconstruction.
Principles
- Agent memory operations introduce a temporal attack channel.
- Fragmenting content evades direct access control checks.
- Reconstructing content from memory bypasses explicit query scrutiny.
Method
FragFuse identifies rejection-responsive fragments, injects them into memory using marker carrier queries, then retrieves and fuses them via a follow-up attack query. An automated optimization scheme tunes fusion instructions and marker designs.
In practice
- Evaluate agent access controls against memory-based attacks.
- Implement memory sanitization for LLM agent interactions.
- Develop detectors for fragmented harmful content patterns.
Topics
- LLM Agents
- Access Control Bypass
- Memory Attacks
- Query Fragmentation
- AI Security
- Vulnerability
Best for: CTO, AI Architect, VP of Engineering/Data, AI Scientist, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.