Language Models as Interfaces, Not Oracles: A Hybrid LLM-ML System for Pediatric Appendicitis

2026-06-18 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Clinical AI Systems · Depth: Expert, extended

Summary

ClaMPAPP (Clinical Language-assisted Machine-learning Pipeline for Appendicitis) is a novel hybrid AI system designed for pediatric appendicitis risk stratification. It integrates a large language model (LLM) as an interface for extracting schema-constrained features from free-text clinical narratives, which are then fed into an XGBoost classifier for robust risk prediction. Evaluated across two independent German pediatric cohorts (Regensburg, n=782; Düsseldorf, n=301), ClaMPAPP demonstrated superior diagnostic performance compared to various end-to-end LLM baselines. It achieved an internal accuracy of 85.1% and F1 score of 84.8% with 97.7% sensitivity, and an external accuracy of 80.7% and F1 score of 0.881 with 93.5% sensitivity, significantly minimizing missed appendicitis cases (FN=2 internally, FN=15 externally). The system also proved robust to narrative reordering, a known vulnerability for standalone LLMs, which often showed substantial performance degradation (e.g., Llama-3.1-8b accuracy dropped to 0.475 on permuted data). This architecture prioritizes safety and audibility by separating natural-language processing from core predictive inference.

Key takeaway

For machine learning engineers developing clinical decision support systems, you should prioritize hybrid architectures that separate natural language processing from predictive inference. Your systems should use LLMs for structured feature extraction from clinical notes, but delegate final risk prediction to robust, validated ML models like XGBoost, incorporating deterministic validation layers. This approach minimizes false negatives, enhances audibility, and improves robustness to input variations, offering a safer pathway for deploying AI in high-stakes medical triage.

Key insights

Hybrid AI systems improve clinical decision support by using LLMs for feature extraction and validated ML models for reliable, auditable prediction.

Principles

LLMs are unreliable as standalone diagnostic engines in critical settings.
Separating language interface from prediction enhances safety and audibility.
Deterministic validation layers improve extracted data quality before inference.

Method

ClaMPAPP's workflow involves LLM-based feature extraction from narratives, deterministic validation of extracted values, XGBoost risk prediction, and structured clinician-facing report generation.

In practice

Use LLMs for schema-constrained feature parsing from clinical notes.
Integrate deterministic validation for extracted clinical data ranges.
Employ XGBoost for robust, missing-value-aware risk prediction.

Topics

Pediatric Appendicitis
Hybrid AI Systems
Large Language Models
Clinical Decision Support
XGBoost Classifier
Feature Extraction

Best for: NLP Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.