When Embedding-Based Defenses Fail: Rethinking Safety in LLM-Based Multi-Agent Systems

2026-05-05 · Source: cs.MA updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, long

Summary

Large language model (LLM)-powered multi-agent systems (MAS) face significant safety risks from malicious agents propagating misinformation and manipulating group decisions. Existing embedding-based defenses, which detect and prune suspicious agents based on text embedding separation, are vulnerable to adaptive attacks. Attackers can craft messages whose embeddings lie close to benign ones, circumventing detection. This paper theoretically and empirically validates this failure mode using three attacks: Slow Drift, Benign Wrapper, and Chaos Seeding. The analysis reveals a fundamental limitation of embedding-based defenses: they ignore token-level confidence signals (logits) that remain informative even when embeddings are indistinguishable. The authors propose using confidence scores to prune or down-weight messages during MAS communication, demonstrating improved robustness across models, datasets, and communication topologies. They also find that confidence signal effectiveness decays over communication rounds, emphasizing the need for early intervention.

Key takeaway

For research scientists and engineers developing LLM-powered multi-agent systems, your current embedding-based defenses may be fundamentally vulnerable to sophisticated, near-benign attacks. You should integrate token-level confidence signals into your defense mechanisms to complement embedding-based approaches, as these internal signals provide crucial reliability cues when external text embeddings are compromised. Prioritize early-stage intervention in communication rounds, especially in denser topologies, to mitigate the rapid spread of malicious content and preserve signal discriminability.

Key insights

Embedding-based MAS defenses fail when attackers craft near-benign messages, necessitating token-level confidence signals for robustness.

Principles

Embedding separability is not a robust defense metric.
Internal model signals complement external text embeddings.
Early intervention is critical in multi-round communication.

Method

The proposed method uses token-level confidence scores to prune low-confidence messages or reduce their weight during aggregation in multi-agent system communication, especially when embedding-based defenses are compromised.

In practice

Implement token-level confidence scoring for MAS messages.
Prioritize early-round intervention in MAS defense.
Evaluate defense robustness against near-benign attacks.

Topics

LLM-based Multi-Agent Systems
Embedding-Based Defenses
Near-Benign Attacks
Token-Level Confidence
Communication Topologies

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.MA updates on arXiv.org.