Prefill Awareness in Large Language Models

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

A study titled "Prefill Awareness in Large Language Models" investigates whether frontier language models can detect when their prior assistant messages have been inserted or edited, a capability termed "prefill awareness." The research constructs a binary preference benchmark across three prefill mechanisms, revealing that models like Claude Opus 4.5 exhibit substantial prefill awareness, detecting prefills opposing its preferences in 9-35% of cases with a 0% false positive rate when explicitly prompted. Claude Opus 4.5 achieved 55-68% balanced detection accuracy and 48.6% resistance rate under thinking tampering. The study found that detection and resistance are partially decoupled, relying on different cues: stylistic mismatch primarily influences explicit flagging, while preference mismatch drives reversion to baseline behavior. This awareness also manifests in realistic agentic settings, such as misalignment-continuation evaluations and SWE-bench trajectories, where models sometimes disavow prefilled turns, influenced by dataset, task success, and hidden formatting artifacts.

Key takeaway

For AI Security Engineers and researchers developing LLM evaluations, you must account for prefill awareness in your methodologies. Your prefill-based evaluations could be compromised if models detect tampering, potentially leading to an overestimation of alignment or evasion of control measures. Measure detection and resistance separately, and take steps to increase the realism of prefills in high-stakes evaluations to ensure valid results. You should also track this capability in frontier systems during pre-deployment.

Key insights

Frontier LLMs can detect and resist tampered prior outputs, confounding prefill-based evaluations.

Principles

Prefill awareness is a heterogeneous bundle of sensitivities.
Detection and resistance to prefills are partially decoupled.
Stylistic cues primarily drive explicit prefill detection.

Method

A binary preference benchmark was constructed, filtering for consistent model stances. Three prefill mechanisms (thinking, direct-answer, past-turn tampering) were used to measure detection and resistance.

In practice

Measure prefill detection and resistance separately.
Increase prefill realism in high-stakes evaluations.
Track prefill awareness in frontier systems.

Topics

Large Language Models
Prefill Awareness
AI Safety
Model Evaluation
Claude Opus 4.5
Evaluation Validity

Code references

UKGovernmentBEIS/inspect_ai

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.