Decoding the Multimodal Maze: A Systematic Review on the Adoption of Explainability in Multimodal Attention-based Models
Summary
This systematic literature review, covering research from January 2020 to early 2024, analyzes the adoption of explainability in multimodal attention-based models. It reveals that most studies focus on vision-language and language-only models, primarily employing attention-based techniques for explanation. A critical finding is the non-systematic nature of XAI evaluation methods in multimodal settings, which often lack consistency, robustness, and consideration for modality-specific cognitive factors. The review provides comprehensive recommendations to foster rigorous, transparent, and standardized evaluation and reporting practices, aiming to support the development of more interpretable, accountable, and responsible multimodal AI systems.
Key takeaway
For AI Scientists and Research Scientists developing multimodal attention-based models, you should prioritize integrating explainability as a core design objective, not an afterthought. Focus on developing and adopting standardized, cognition- and domain-aware evaluation metrics that explicitly quantify inter-modal interactions, moving beyond ad hoc qualitative analyses. This will ensure your models are not only performant but also transparent, trustworthy, and align with increasing regulatory requirements.
Key insights
Multimodal XAI evaluation lacks standardization, hindering interpretable and responsible AI development.
Principles
- Multimodal XAI evaluation is largely non-systematic.
- Attention weights are common but debated XAI tools.
- Explainability should be a core design objective.
Method
The review proposes a comprehensive guideline for advancing standardized practices in incorporating explainability into multimodal models and streamlining their evaluation across tasks and domains.
In practice
- Utilize toolkits like Inseq, VISIT, or VL-InterpreT for visualization.
- Apply positive/negative perturbation analysis for faithfulness.
- Employ segmentation metrics (mIoU, AP, AR) for localization.
Topics
- Multimodal Learning
- Explainable AI
- Attention Models
- Vision-Language Models
- Explainability Evaluation
- Transformer Architectures
Best for: AI Scientist, Research Scientist, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.