What the Eyes See, the LLMs Miss: Exploiting Human Perception for Adversarial Text Attacks
Summary
A new class of Human-Perceptible Adversarial Attacks (HPAA) exploits a fundamental perceptual mismatch between human interpretation and LLM-powered content moderation systems. These attacks embed harmful expressions into otherwise benign text using visually salient typographic manipulations, such as strategic spacing, visual emphasis, and spatial arrangement. While humans readily recognize the harmful content, automated moderation systems, which primarily operate on tokenized text and ignore visual cues, largely miss it. Operating in black-box settings with only a small query budget, HPAA automatically generates evasive content without requiring model access or gradient information. Evaluations across multiple datasets and ten deployed moderation systems, including commercial APIs, revealed that generated attacks achieve over 86% human recognition while maintaining detection rates below 1% with just three detector queries. This research exposes a significant blind spot in current LLM-based moderation and highlights the need for systems that align more closely with human perceptual understanding.
Key takeaway
For AI Security Engineers evaluating LLM-based content moderation, recognize that current token-based systems are fundamentally blind to visually-manipulated harmful content. Your deployed guardrails, including commercial APIs and open-source solutions, are highly susceptible to Human-Perceptible Adversarial Attacks, achieving less than 1% detection against content 86% recognizable by humans. You must integrate visual reasoning capabilities into your moderation pipelines or risk significant evasion of harmful online content.
Key insights
LLM content moderation systems are vulnerable to visually-based adversarial attacks that exploit human perception.
Principles
- Typographic features can bypass token-based moderation.
- Human perception relies on visual cues LLMs miss.
- Black-box attacks are effective with minimal queries.
Method
The attack automatically generates evasive content by strategically combining typographic features (spacing, emphasis, arrangement) to embed harmful expressions, operating in black-box settings with a small query budget.
In practice
- Test moderation systems against visual typographic attacks.
- Integrate visual processing into LLM moderation pipelines.
- Prioritize human-aligned content understanding.
Topics
- Large Language Models
- Content Moderation
- Adversarial Attacks
- Typographic Manipulations
- Human Perception
- Online Content Safety
Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Security Engineer, AI Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.