Prefill Awareness in Large Language Models
Summary
A study titled "Prefill Awareness in Large Language Models" investigates whether frontier language models can detect when their prior assistant messages have been inserted or edited, a capability termed "prefill awareness." The research constructs a binary preference benchmark across three prefill mechanisms, revealing that models like Claude Opus 4.5 exhibit substantial prefill awareness, detecting prefills opposing its preferences in 9-35% of cases with a 0% false positive rate when explicitly prompted. Claude Opus 4.5 achieved 55-68% balanced detection accuracy and 48.6% resistance rate under thinking tampering. The study found that detection and resistance are partially decoupled, relying on different cues: stylistic mismatch primarily influences explicit flagging, while preference mismatch drives reversion to baseline behavior. This awareness also manifests in realistic agentic settings, such as misalignment-continuation evaluations and SWE-bench trajectories, where models sometimes disavow prefilled turns, influenced by dataset, task success, and hidden formatting artifacts.
Key takeaway
For AI Security Engineers and researchers developing LLM evaluations, you must account for prefill awareness in your methodologies. Your prefill-based evaluations could be compromised if models detect tampering, potentially leading to an overestimation of alignment or evasion of control measures. Measure detection and resistance separately, and take steps to increase the realism of prefills in high-stakes evaluations to ensure valid results. You should also track this capability in frontier systems during pre-deployment.
Key insights
Frontier LLMs can detect and resist tampered prior outputs, confounding prefill-based evaluations.
Principles
- Prefill awareness is a heterogeneous bundle of sensitivities.
- Detection and resistance to prefills are partially decoupled.
- Stylistic cues primarily drive explicit prefill detection.
Method
A binary preference benchmark was constructed, filtering for consistent model stances. Three prefill mechanisms (thinking, direct-answer, past-turn tampering) were used to measure detection and resistance.
In practice
- Measure prefill detection and resistance separately.
- Increase prefill realism in high-stakes evaluations.
- Track prefill awareness in frontier systems.
Topics
- Large Language Models
- Prefill Awareness
- AI Safety
- Model Evaluation
- Claude Opus 4.5
- Evaluation Validity
Code references
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.