MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events
Summary
MADE is a new, continuously updated multi-label text classification (MLTC) benchmark designed for medical device adverse event reports, addressing challenges like label imbalances, dependencies, and combinatorial complexity in high-stakes healthcare AI. It features a long-tailed distribution of hierarchical labels and uses strict temporal splits to prevent training data contamination and ensure reproducible evaluation. The benchmark establishes baselines across over 20 encoder- and decoder-only models, including fine-tuned and few-shot instruction-tuned/reasoning variants. It also systematically assesses entropy-based, consistency-based, and self-verbalized uncertainty quantification (UQ) methods. Key findings indicate that smaller discriminatively fine-tuned decoders offer strong accuracy and competitive UQ, while generative fine-tuning provides the most reliable UQ.
Key takeaway
For NLP Engineers developing MLTC systems in high-stakes domains like healthcare, you should consider MADE for benchmarking. Its continuous updates and temporal splits offer a robust evaluation environment, helping you distinguish genuine model capabilities from memorization. Prioritize smaller discriminatively fine-tuned decoders for strong accuracy with competitive UQ, or generative fine-tuning if reliable UQ is your primary concern, especially for rare labels.
Key insights
MADE is a living MLTC benchmark for medical device adverse events, emphasizing reliable uncertainty quantification.
Principles
- Continuous updates prevent data contamination.
- Temporal splits ensure reproducible evaluation.
Method
MADE establishes baselines for over 20 models (encoder/decoder, fine-tuned/few-shot) and systematically assesses entropy-, consistency-, and self-verbalized UQ methods.
In practice
- Smaller discriminatively fine-tuned decoders balance accuracy and UQ.
- Generative fine-tuning yields the most reliable UQ.
Topics
- MADE Benchmark
- Multi-Label Text Classification
- Uncertainty Quantification
- Medical Device Adverse Events
- Healthcare Machine Learning
Best for: NLP Engineer, AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.