When Embedding-Based Defenses Fail: Rethinking Safety in LLM-Based Multi-Agent Systems
Summary
Large language model (LLM)-powered multi-agent systems (MAS) face significant safety risks from malicious agents propagating misinformation and manipulating group decisions. Existing embedding-based defenses, which detect and prune suspicious agents based on text embedding separation, are vulnerable to adaptive attacks. Attackers can craft messages whose embeddings lie close to benign ones, circumventing detection. This paper theoretically and empirically validates this failure mode using three attacks: Slow Drift, Benign Wrapper, and Chaos Seeding. The analysis reveals a fundamental limitation of embedding-based defenses: they ignore token-level confidence signals (logits) that remain informative even when embeddings are indistinguishable. The authors propose using confidence scores to prune or down-weight messages during MAS communication, demonstrating improved robustness across models, datasets, and communication topologies. They also find that confidence signal effectiveness decays over communication rounds, emphasizing the need for early intervention.
Key takeaway
For research scientists and engineers developing LLM-powered multi-agent systems, your current embedding-based defenses may be fundamentally vulnerable to sophisticated, near-benign attacks. You should integrate token-level confidence signals into your defense mechanisms to complement embedding-based approaches, as these internal signals provide crucial reliability cues when external text embeddings are compromised. Prioritize early-stage intervention in communication rounds, especially in denser topologies, to mitigate the rapid spread of malicious content and preserve signal discriminability.
Key insights
Embedding-based MAS defenses fail when attackers craft near-benign messages, necessitating token-level confidence signals for robustness.
Principles
- Embedding separability is not a robust defense metric.
- Internal model signals complement external text embeddings.
- Early intervention is critical in multi-round communication.
Method
The proposed method uses token-level confidence scores to prune low-confidence messages or reduce their weight during aggregation in multi-agent system communication, especially when embedding-based defenses are compromised.
In practice
- Implement token-level confidence scoring for MAS messages.
- Prioritize early-round intervention in MAS defense.
- Evaluate defense robustness against near-benign attacks.
Topics
- LLM-based Multi-Agent Systems
- Embedding-Based Defenses
- Near-Benign Attacks
- Token-Level Confidence
- Communication Topologies
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.MA updates on arXiv.org.