L2D-Clinical: Learning to Defer for Adaptive Model Selection in Clinical Text Classification

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Health & Medical Research · Depth: Expert, extended

Summary

L2D-Clinical is a novel framework for clinical text classification that adaptively selects between specialized fine-tuned BERT models and general-purpose Large Language Models (LLMs) based on uncertainty signals and text characteristics. Unlike prior Learning to Defer (L2D) approaches that defer to universally superior human experts, L2D-Clinical enables AI-to-AI deferral, even when the LLM is weaker overall than BERT. Evaluated on two English clinical tasks, Adverse Drug Event (ADE) detection (ADE Corpus V2) and treatment outcome classification (MIMIC-IV), L2D-Clinical demonstrated significant F1 score improvements. On ADE, it achieved F1=0.928 (+1.7 points over BioBERT's 0.911) by deferring only 7% of instances. On MIMIC-IV, it reached F1=0.980 (+9.3 points over ClinicalBERT's 0.887) by deferring 16.8% of cases to GPT-5-nano. The system leverages complementary error patterns and provides substantial LLM API cost reductions of 81-93%.

Key takeaway

For AI Engineers building clinical NLP systems, L2D-Clinical offers a robust strategy to optimize accuracy and cost. You should implement an adaptive deferral mechanism that routes between specialized BERT models and LLMs based on learned uncertainty and text characteristics. This approach allows you to selectively leverage LLM strengths for complex cases while minimizing API expenses, ensuring both high performance and operational efficiency in production.

Key insights

Adaptive AI-to-AI deferral improves clinical text classification by leveraging complementary model strengths and reducing LLM costs.

Principles

Method

A logistic regression deferral model uses BERT's softmax probabilities and text features to predict when BERT will err, routing inputs to an LLM if the error probability exceeds a tuned threshold.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.