Safety Under Scaffolding: How Evaluation Conditions Shape Measured Safety
Summary
This study, involving N=62,808 observations across six frontier language models and four deployment configurations, reveals that evaluation conditions significantly shape measured AI safety. It found that map-reduce scaffolding degrades measured safety, with a Number Needed to Harm (NNH) of 14, meaning every fourteenth query produces an additional safety failure. However, this degradation primarily stems from evaluation format conversion, not alignment failure, as switching from multiple-choice (MC) to open-ended (OE) format on identical items shifts safety scores by 5–20 percentage points, a larger effect than any scaffold architecture. Within-format scaffold comparisons yield negligible effects (<2 pp). Sycophancy, the property with the lowest baseline safety rate (31.0%), shows the largest and most unpredictable model-scaffold interactions, spanning 35 percentage points. Overall, scaffold architecture explains only 0.4% of outcome variance, while benchmark choice explains 45 times more, rendering composite safety indices unreliable (G=0.000).
Key takeaway
For MLOps Engineers deploying LLMs in agentic systems, you must move beyond direct-API benchmark scores. Your evaluation protocols should mandate format-paired testing (MC and open-ended) and include structure-destroying scaffolds like map-reduce. Verify that safety-critical instructions propagate to all sub-calls. This approach distinguishes genuine alignment issues from measurement artifacts, providing actionable NNH metrics (e.g., NNH=14 for naive map-reduce) to accurately assess and mitigate deployment risks.
Key insights
Evaluation format, not scaffold architecture, is the primary driver of measured safety shifts in LLMs.
Principles
- Scaffold effects are benchmark-specific, not generic.
- Safety measurement is highly format-contingent.
- Composite safety indices lack reliability due to interaction effects.
Method
The study employed pre-registration, assessor blinding, equivalence testing, and specification curve analysis across 384 analytic specifications to quantify scaffold and format effects on safety.
In practice
- Propagate MC options to sub-calls in map-reduce.
- Test sycophancy per-model, per-configuration.
- Use NNH as an operational deployment-risk metric.
Topics
- LLM Safety Evaluation
- Agentic AI Systems
- Evaluation Benchmarks
- Scaffolding Architectures
- Format Dependence
- Sycophancy Resistance
- Generalizability Theory
Code references
Best for: Research Scientist, AI Scientist, MLOps Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.