MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events

2026-04-16 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Medical Devices & Health Technology · Depth: Expert, quick

Summary

MADE is a new, continuously updated multi-label text classification (MLTC) benchmark designed for medical device adverse event reports, addressing challenges like label imbalances, dependencies, and combinatorial complexity in high-stakes healthcare AI. It features a long-tailed distribution of hierarchical labels and uses strict temporal splits to prevent training data contamination and ensure reproducible evaluation. The benchmark establishes baselines across over 20 encoder- and decoder-only models, including fine-tuned and few-shot instruction-tuned/reasoning variants. It also systematically assesses entropy-based, consistency-based, and self-verbalized uncertainty quantification (UQ) methods. Key findings indicate that smaller discriminatively fine-tuned decoders offer strong accuracy and competitive UQ, while generative fine-tuning provides the most reliable UQ.

Key takeaway

For NLP Engineers developing MLTC systems in high-stakes domains like healthcare, you should consider MADE for benchmarking. Its continuous updates and temporal splits offer a robust evaluation environment, helping you distinguish genuine model capabilities from memorization. Prioritize smaller discriminatively fine-tuned decoders for strong accuracy with competitive UQ, or generative fine-tuning if reliable UQ is your primary concern, especially for rare labels.

Key insights

MADE is a living MLTC benchmark for medical device adverse events, emphasizing reliable uncertainty quantification.

Principles

Continuous updates prevent data contamination.
Temporal splits ensure reproducible evaluation.

Method

MADE establishes baselines for over 20 models (encoder/decoder, fine-tuned/few-shot) and systematically assesses entropy-, consistency-, and self-verbalized UQ methods.

In practice

Smaller discriminatively fine-tuned decoders balance accuracy and UQ.
Generative fine-tuning yields the most reliable UQ.

Topics

MADE Benchmark
Multi-Label Text Classification
Uncertainty Quantification
Medical Device Adverse Events
Healthcare Machine Learning

Best for: NLP Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.