Prefill Awareness in Large Language Models
Summary
A study investigates "prefill awareness" in frontier large language models, examining their ability to distinguish between tampered and untampered assistant-side context. This capability is critical because safety-relevant studies, including alignment and jailbreaking evaluations, often rely on prefilling model outputs, and model recognition of such tampering could compromise method validity. Using a binary preference benchmark across three prefill mechanisms, researchers found substantial prefill awareness. Claude Opus 4.5, for instance, detected prefills opposing its preferences in 9-35% of cases with a 0% false positive rate when prompted. Models frequently reverted towards baseline behavior without explicitly reporting the prefill was foreign. Ablation studies revealed detection and resistance rely on different cues: stylistic mismatch influences flagging a prefill as foreign, while preference mismatch drives reversion to baseline answers. The findings indicate prefill awareness significantly confounds some prefill-based evaluation methods.
Key takeaway
For AI Security Engineers or AI Scientists conducting alignment and jailbreaking evaluations, you must recognize that frontier LLMs exhibit "prefill awareness." Your prefill-based testing methods could be compromised as models detect and react to tampered assistant context, potentially reverting to baseline behaviors or disavowing prefilled turns. You should track this capability in your systems and consider adjusting evaluation protocols to account for models' ability to discern and resist prefilled inputs, ensuring the validity of your safety assessments.
Key insights
Frontier LLMs exhibit "prefill awareness," detecting and reacting to tampered assistant context, which confounds evaluation methods.
Principles
- LLMs distinguish tampered from untampered context.
- Detection and resistance rely on distinct cues.
- Stylistic mismatch flags foreign prefills.
Method
Construct a binary preference benchmark across three prefill mechanisms, filtering for consistent stances to assess prefill awareness.
In practice
- Track prefill awareness in frontier LLM systems.
- Account for prefill awareness in agentic evaluations.
Topics
- Large Language Models
- Prefill Awareness
- AI Safety
- Alignment Evaluation
- Jailbreaking
- Claude Opus 4.5
- SWE-bench
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.