What the Eyes See, the LLMs Miss: Exploiting Human Perception for Adversarial Text Attacks

2026-06-08 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, quick

Summary

A new class of Human-Perceptible Adversarial Attacks (HPAA) exploits a fundamental perceptual mismatch between human interpretation and LLM-powered content moderation systems. These attacks embed harmful expressions into otherwise benign text using visually salient typographic manipulations, such as strategic spacing, visual emphasis, and spatial arrangement. While humans readily recognize the harmful content, automated moderation systems, which primarily operate on tokenized text and ignore visual cues, largely miss it. Operating in black-box settings with only a small query budget, HPAA automatically generates evasive content without requiring model access or gradient information. Evaluations across multiple datasets and ten deployed moderation systems, including commercial APIs, revealed that generated attacks achieve over 86% human recognition while maintaining detection rates below 1% with just three detector queries. This research exposes a significant blind spot in current LLM-based moderation and highlights the need for systems that align more closely with human perceptual understanding.

Key takeaway

For AI Security Engineers evaluating LLM-based content moderation, recognize that current token-based systems are fundamentally blind to visually-manipulated harmful content. Your deployed guardrails, including commercial APIs and open-source solutions, are highly susceptible to Human-Perceptible Adversarial Attacks, achieving less than 1% detection against content 86% recognizable by humans. You must integrate visual reasoning capabilities into your moderation pipelines or risk significant evasion of harmful online content.

Key insights

LLM content moderation systems are vulnerable to visually-based adversarial attacks that exploit human perception.

Principles

Typographic features can bypass token-based moderation.
Human perception relies on visual cues LLMs miss.
Black-box attacks are effective with minimal queries.

Method

The attack automatically generates evasive content by strategically combining typographic features (spacing, emphasis, arrangement) to embed harmful expressions, operating in black-box settings with a small query budget.

In practice

Test moderation systems against visual typographic attacks.
Integrate visual processing into LLM moderation pipelines.
Prioritize human-aligned content understanding.

Topics

Large Language Models
Content Moderation
Adversarial Attacks
Typographic Manipulations
Human Perception
Online Content Safety

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Security Engineer, AI Scientist, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.