What the Eyes See, the LLMs Miss: Exploiting Human Perception for Adversarial Text Attacks

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, quick

Summary

A new class of Human-Perceptible Adversarial Attacks (HPAA) exploits a fundamental perceptual mismatch between human interpretation and LLM-powered content moderation systems. These attacks embed harmful expressions into otherwise benign text using visually salient typographic manipulations, such as strategic spacing, visual emphasis, and spatial arrangement. While humans readily recognize the harmful content, automated moderation systems, which primarily operate on tokenized text and ignore visual cues, largely miss it. Operating in black-box settings with only a small query budget, HPAA automatically generates evasive content without requiring model access or gradient information. Evaluations across multiple datasets and ten deployed moderation systems, including commercial APIs, revealed that generated attacks achieve over 86% human recognition while maintaining detection rates below 1% with just three detector queries. This research exposes a significant blind spot in current LLM-based moderation and highlights the need for systems that align more closely with human perceptual understanding.

Key takeaway

For AI Security Engineers evaluating LLM-based content moderation, recognize that current token-based systems are fundamentally blind to visually-manipulated harmful content. Your deployed guardrails, including commercial APIs and open-source solutions, are highly susceptible to Human-Perceptible Adversarial Attacks, achieving less than 1% detection against content 86% recognizable by humans. You must integrate visual reasoning capabilities into your moderation pipelines or risk significant evasion of harmful online content.

Key insights

LLM content moderation systems are vulnerable to visually-based adversarial attacks that exploit human perception.

Principles

Method

The attack automatically generates evasive content by strategically combining typographic features (spacing, emphasis, arrangement) to embed harmful expressions, operating in black-box settings with a small query budget.

In practice

Topics

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Security Engineer, AI Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.