A Framework for Measuring Appropriate Reliance on Set-Valued AI Advice
Summary
This paper introduces the first formal framework for measuring appropriate reliance on set-valued AI advice, addressing a gap where existing methods only consider point predictions. Set-valued advice, such as discrete prediction sets for classification or continuous intervals for regression, is increasingly used to communicate AI uncertainty. Operating within the sequential judge-advisor paradigm, the framework defines distinct metrics for each task type. For classification, it uses Correct Reliance Rate on AI (CRR_AI) and Correct Reliance Rate on Self (CRR_self) to jointly characterize appropriate reliance. For regression, it introduces Quantity of AI Reliance (AIR_quant) and Quality of AI Reliance (AIR_qual), which measure how much a decision maker uses AI advice and whether that use improves their decision relative to the ground truth. This diagnostic tool helps identify specific failure modes like automation bias or algorithm aversion, which traditional accuracy or Weight of Advice (WoA) metrics cannot distinguish, thereby informing better system design and intervention evaluation.
Key takeaway
For AI Scientists and practitioners evaluating human-AI collaboration, you should move beyond traditional accuracy or Weight of Advice (WoA) metrics. Your evaluations must incorporate the proposed CRR_AI, CRR_self, AIR_quant, and AIR_qual metrics to accurately diagnose specific reliance failure modes like automation bias or algorithm aversion. This allows you to design targeted interventions and systems that foster appropriate reliance on set-valued AI advice, ensuring genuine human oversight and preventing unintended harms.
Key insights
Appropriate reliance on AI advice requires distinct metrics for set-valued predictions, separating quantity from quality of reliance.
Principles
- AI reliance metrics must condition on ground truth.
- Accuracy alone conflates beneficial and detrimental reliance.
- Set-valued advice needs specific reliance measures.
Method
For classification, it defines CRR_AI and CRR_self based on AI informativeness. For regression, it uses AIR_quant (behavioral adjustment) and AIR_qual (error improvement) to assess reliance on interval midpoints.
In practice
- Diagnose automation bias vs. algorithm aversion.
- Evaluate interventions beyond accuracy or WoA.
- Design systems for verifiability.
Topics
- Human-AI Collaboration
- Set-Valued AI Advice
- Appropriate Reliance Metrics
- Classification Tasks
- Regression Tasks
- Automation Bias
Best for: AI Scientist, Research Scientist, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.