Decoy-Calibrated Failure Audits for Language Models

2026-06-08 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Janus is a novel procedure designed to provide credible failure audits for Language Models by addressing selection bias in identifying error explanations. It evaluates candidate explanations, termed descriptors (e.g., "long inputs" or "indirect questions"), by their error-rate lift. Janus compares these real descriptors against "decoy" descriptors, which share the same frequencies but are randomly assigned to examples. A descriptor is only confirmed as a true failure mode if it surpasses this decoy floor on the initial discovery data and subsequently replicates its effect on separate, held-out data. In controlled multi-table lookup tasks, Janus successfully identified planted long-chain failure modes where LLMs halted prematurely. However, on public benchmarks like MuSiQue and LongBench v2, while the SliceLine baseline indicated potential high-error areas, Janus confirmed none, demonstrating its stringent validation. Ablation studies on LongBench v2 highlighted the necessity of both safeguards: an uncalibrated threshold reported 20 descriptors, the decoy floor reduced this to one, and the holdout check ultimately rejected the last, showing its lift shrinking from 0.36 to 0.05.

Key takeaway

For Machine Learning Engineers tasked with auditing Language Model failures, you should adopt a rigorous validation process to avoid reporting spurious error explanations. Implement a two-stage approach like Janus, where candidate failure modes are first calibrated against decoy explanations and then confirmed on separate held-out data. This ensures your reported findings are genuine and transferable, preventing wasted effort on non-replicable issues and improving the reliability of your model diagnostics.

Key insights

Janus provides a rigorous, two-stage validation process for identifying genuine Language Model failure modes, mitigating selection bias.

Principles

Audit findings must beat decoys and replicate.
Decoy calibration prevents false positives.
Holdout validation ensures generalizability.

Method

Janus scores candidate descriptors by error-rate lift, comparing them to frequency-matched random decoys. Confirmation requires beating the decoy floor on discovery data and replicating on held-out data.

In practice

Rigorously identify LLM failure modes.
Validate error explanations using decoys.
Confirm findings on separate held-out data.

Topics

Language Model Auditing
LLM Failure Modes
Decoy Calibration
Selection Bias
Model Evaluation
Janus Procedure

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.