MIRROR: A Hierarchical Benchmark for Metacognitive Calibration in Large Language Models

2026-04-24 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

The Mirror benchmark, comprising eight experiments across four metacognitive levels, evaluates 16 large language models (LLMs) from 8 labs using approximately 250,000 evaluation instances. The study finds that compositional self-prediction universally fails, with the Compositional Calibration Error (CCE) ranging from 0.500 to 0.943 on the original 15-model Exp3-v1 set and 0.434 to 0.758 on the balanced 16-model Exp3-v2 expansion. Furthermore, models exhibit above-chance but imperfect domain-specific self-knowledge yet fail to translate this awareness into appropriate agentic action-selection. External metacognitive control reduces the Confident Failure Rate (CFR) from 0.600 to 0.143 (a 76% reduction at temperature 0; mean 70% at temperature 0.7 across 5 models from 4 labs). Providing models with their own calibration scores yields no significant improvement (p>0.05); only architectural constraint proves effective, suggesting external scaffolding is crucial for safer autonomous AI systems.

Key takeaway

For AI Architects designing autonomous LLM agents, relying on a model's internal self-assessment for safety is insufficient. You should prioritize implementing external metacognitive scaffolding, such as architectural constraints or external routing systems based on pre-deployment calibration profiles, to enforce appropriate action-selection. This approach, which reduced Confident Failure Rate by 76% in testing, is more effective than attempting to improve the model's intrinsic self-knowledge or providing it with its own calibration scores.

Key insights

LLMs possess partial self-knowledge but universally fail to translate it into appropriate agentic action without external architectural control.

Principles

Compositional self-prediction universally fails in LLMs.
Self-knowledge in LLMs is domain-atomic and does not transfer.
External architectural constraint is critical for LLM agentic safety.

Method

Mirror evaluates LLM metacognition across four levels (atomic self-knowledge, cross-domain transfer, compositional prediction, adaptive self-regulation) using five independent behavioral measurement channels (wagering, opt-out, difficulty selection, tool delegation, natural language signals).

In practice

Implement external routing systems for LLM agentic tasks.
Do not rely solely on LLM self-reported uncertainty for safety.
Focus on architectural interventions over improved self-knowledge exposure.

Topics

Mirror Benchmark
Metacognitive Calibration
Large Language Models
Agentic AI Systems
Compositional Self-Prediction

Best for: Research Scientist, AI Architect, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.