L2D-Clinical: Learning to Defer for Adaptive Model Selection in Clinical Text Classification

2026-04-16 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Health & Medical Research · Depth: Expert, extended

Summary

L2D-Clinical is a novel framework for clinical text classification that adaptively selects between specialized fine-tuned BERT models and general-purpose Large Language Models (LLMs) based on uncertainty signals and text characteristics. Unlike prior Learning to Defer (L2D) approaches that defer to universally superior human experts, L2D-Clinical enables AI-to-AI deferral, even when the LLM is weaker overall than BERT. Evaluated on two English clinical tasks, Adverse Drug Event (ADE) detection (ADE Corpus V2) and treatment outcome classification (MIMIC-IV), L2D-Clinical demonstrated significant F1 score improvements. On ADE, it achieved F1=0.928 (+1.7 points over BioBERT's 0.911) by deferring only 7% of instances. On MIMIC-IV, it reached F1=0.980 (+9.3 points over ClinicalBERT's 0.887) by deferring 16.8% of cases to GPT-5-nano. The system leverages complementary error patterns and provides substantial LLM API cost reductions of 81-93%.

Key takeaway

For AI Engineers building clinical NLP systems, L2D-Clinical offers a robust strategy to optimize accuracy and cost. You should implement an adaptive deferral mechanism that routes between specialized BERT models and LLMs based on learned uncertainty and text characteristics. This approach allows you to selectively leverage LLM strengths for complex cases while minimizing API expenses, ensuring both high performance and operational efficiency in production.

Key insights

Adaptive AI-to-AI deferral improves clinical text classification by leveraging complementary model strengths and reducing LLM costs.

Principles

Aggregate F1 scores mask instance-level model variation.
Complementary error profiles enable gains even from weaker models.
Interpretable features drive effective deferral decisions.

Method

A logistic regression deferral model uses BERT's softmax probabilities and text features to predict when BERT will err, routing inputs to an LLM if the error probability exceeds a tuned threshold.

In practice

Use L2D for cost-effective LLM integration.
Prioritize BERT's uncertainty and text features for deferral.
Employ multi-LLM consensus for high-quality ground truth.

Topics

Learning to Defer (L2D)
Clinical Text Classification
Hybrid AI Systems
BERT Models
Large Language Models

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.