Prefill Awareness in Large Language Models

2026-06-10 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, quick

Summary

A study investigates "prefill awareness" in frontier large language models, examining their ability to distinguish between tampered and untampered assistant-side context. This capability is critical because safety-relevant studies, including alignment and jailbreaking evaluations, often rely on prefilling model outputs, and model recognition of such tampering could compromise method validity. Using a binary preference benchmark across three prefill mechanisms, researchers found substantial prefill awareness. Claude Opus 4.5, for instance, detected prefills opposing its preferences in 9-35% of cases with a 0% false positive rate when prompted. Models frequently reverted towards baseline behavior without explicitly reporting the prefill was foreign. Ablation studies revealed detection and resistance rely on different cues: stylistic mismatch influences flagging a prefill as foreign, while preference mismatch drives reversion to baseline answers. The findings indicate prefill awareness significantly confounds some prefill-based evaluation methods.

Key takeaway

For AI Security Engineers or AI Scientists conducting alignment and jailbreaking evaluations, you must recognize that frontier LLMs exhibit "prefill awareness." Your prefill-based testing methods could be compromised as models detect and react to tampered assistant context, potentially reverting to baseline behaviors or disavowing prefilled turns. You should track this capability in your systems and consider adjusting evaluation protocols to account for models' ability to discern and resist prefilled inputs, ensuring the validity of your safety assessments.

Key insights

Frontier LLMs exhibit "prefill awareness," detecting and reacting to tampered assistant context, which confounds evaluation methods.

Principles

LLMs distinguish tampered from untampered context.
Detection and resistance rely on distinct cues.
Stylistic mismatch flags foreign prefills.

Method

Construct a binary preference benchmark across three prefill mechanisms, filtering for consistent stances to assess prefill awareness.

In practice

Track prefill awareness in frontier LLM systems.
Account for prefill awareness in agentic evaluations.

Topics

Large Language Models
Prefill Awareness
AI Safety
Alignment Evaluation
Jailbreaking
Claude Opus 4.5
SWE-bench

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.