L2D-Clinical: Learning to Defer for Adaptive Model Selection in Clinical Text Classification
Summary
L2D-Clinical is a novel framework for clinical text classification that adaptively selects between specialized fine-tuned BERT models and general-purpose Large Language Models (LLMs) based on uncertainty signals and text characteristics. Unlike prior Learning to Defer (L2D) approaches that defer to universally superior human experts, L2D-Clinical enables AI-to-AI deferral, even when the LLM is weaker overall than BERT. Evaluated on two English clinical tasks, Adverse Drug Event (ADE) detection (ADE Corpus V2) and treatment outcome classification (MIMIC-IV), L2D-Clinical demonstrated significant F1 score improvements. On ADE, it achieved F1=0.928 (+1.7 points over BioBERT's 0.911) by deferring only 7% of instances. On MIMIC-IV, it reached F1=0.980 (+9.3 points over ClinicalBERT's 0.887) by deferring 16.8% of cases to GPT-5-nano. The system leverages complementary error patterns and provides substantial LLM API cost reductions of 81-93%.
Key takeaway
For AI Engineers building clinical NLP systems, L2D-Clinical offers a robust strategy to optimize accuracy and cost. You should implement an adaptive deferral mechanism that routes between specialized BERT models and LLMs based on learned uncertainty and text characteristics. This approach allows you to selectively leverage LLM strengths for complex cases while minimizing API expenses, ensuring both high performance and operational efficiency in production.
Key insights
Adaptive AI-to-AI deferral improves clinical text classification by leveraging complementary model strengths and reducing LLM costs.
Principles
- Aggregate F1 scores mask instance-level model variation.
- Complementary error profiles enable gains even from weaker models.
- Interpretable features drive effective deferral decisions.
Method
A logistic regression deferral model uses BERT's softmax probabilities and text features to predict when BERT will err, routing inputs to an LLM if the error probability exceeds a tuned threshold.
In practice
- Use L2D for cost-effective LLM integration.
- Prioritize BERT's uncertainty and text features for deferral.
- Employ multi-LLM consensus for high-quality ground truth.
Topics
- Learning to Defer (L2D)
- Clinical Text Classification
- Hybrid AI Systems
- BERT Models
- Large Language Models
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.