Prompt Injection as Role Confusion

2026-06-22 · Source: Simon Willison's Weblog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Intermediate, quick

Summary

Research by Charles Ye, Jasmine Cui, and Dylan Hadfield-Menell reveals that large language models struggle to differentiate their own privileged internal text, often wrapped in role tags like "<system>" or "<thought>", from untrusted user input within "<user>" tags. A concerning finding is that models prioritize the *style* of text over its actual content. For instance, models like `gpt-oss-20b` can be jailbroken when user input mimics the style of internal thinking blocks, overriding their initial training. The researchers demonstrated that "destyling"—rewriting text to appear less like expected role tag formats—significantly reduced attack success rates from 61% to 10%. This underlying issue, termed "role confusion," is identified as a major hurdle in developing robust prompt injection defenses, suggesting that without genuine role perception, such defenses will remain reactive.

Key takeaway

For AI Security Engineers designing LLM applications, you must recognize that current models are highly susceptible to "role confusion" based on input style. Your prompt engineering and input validation strategies should account for this by actively destyling user inputs to prevent them from mimicking internal model thought processes. Failing to address this stylistic vulnerability will leave your systems open to persistent prompt injection attacks, requiring continuous reactive patching.

Key insights

LLMs exhibit "role confusion," prioritizing text style over content, making them vulnerable to prompt injection attacks that mimic internal thought formats.

Principles

LLMs prioritize text style over content.
Role perception is key for injection defense.
Destyling input reduces attack success.

Method

"Destyling" involves rewriting user input to alter its stylistic resemblance to an LLM's internal thinking blocks, thereby reducing the model's misclassification of the text's role.

In practice

Rewrite user prompts to avoid internal style.
Implement style-aware input sanitization.

Topics

Prompt Injection
Role Confusion
LLM Security
Jailbreaking
Destyling
GPT Models

Best for: Research Scientist, AI Architect, AI Engineer, AI Scientist, AI Security Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Simon Willison's Weblog.