LLMs Can’t Provide Faithful Explanations Needed for AI Accountability
Summary
Research indicates that explanations generated by Large Language Models (LLMs) often lack "explanation faithfulness," meaning they do not accurately represent how the model processes inputs into outputs. This is a critical concern for accountability, especially in high-stakes AI applications where understanding the "how" of an outcome is essential for process accountability. Unfaithful explanations can misdirect efforts to assign blame, diagnose errors, or challenge decisions, potentially leading to incorrect sanctions or system improvements. Studies by Madsen et al. (2024) and Mayne et al. (2025) demonstrate that while larger LLMs might produce slightly more faithful explanations, there is high variance, and self-generated counterfactual explanations can be misleading. Consequently, LLMs are currently unable to provide the faithful explanations necessary for robust accountability frameworks.
Key takeaway
For AI Scientists and Research Scientists developing or deploying AI systems in high-stakes contexts, you must recognize that current LLMs cannot reliably provide faithful explanations. This limitation directly impacts your ability to ensure process accountability and diagnose system errors. Prioritize developing or integrating methods to rigorously measure explanation faithfulness, and consider using inherently interpretable models where accountability is paramount to avoid misleading insights and potential regulatory challenges.
Key insights
LLMs struggle to provide faithful explanations, hindering accountability in high-stakes AI applications.
Principles
- Faithful explanations accurately represent model reasoning.
- Unfaithful explanations undermine accountability.
- Interpretable models can yield more faithful explanations.
In practice
- Assess explanation faithfulness for high-stakes AI.
- Develop benchmarks for explanation faithfulness.
- Consider interpretable models for critical decisions.
Topics
- Explanation Faithfulness
- Large Language Models
- AI Accountability
- Interpretable AI
- Counterfactual Explanations
Best for: AI Scientist, Research Scientist, AI Researcher, AI Ethicist, Policy Maker
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AI Accountability Review.