Language Models as Interfaces, Not Oracles: A Hybrid LLM-ML System for Pediatric Appendicitis
Summary
ClaMPAPP (Clinical Language-assisted Machine-learning Pipeline for Appendicitis) is a novel hybrid AI system designed for pediatric appendicitis risk stratification. It integrates a large language model (LLM) as an interface for extracting schema-constrained features from free-text clinical narratives, which are then fed into an XGBoost classifier for robust risk prediction. Evaluated across two independent German pediatric cohorts (Regensburg, n=782; Düsseldorf, n=301), ClaMPAPP demonstrated superior diagnostic performance compared to various end-to-end LLM baselines. It achieved an internal accuracy of 85.1% and F1 score of 84.8% with 97.7% sensitivity, and an external accuracy of 80.7% and F1 score of 0.881 with 93.5% sensitivity, significantly minimizing missed appendicitis cases (FN=2 internally, FN=15 externally). The system also proved robust to narrative reordering, a known vulnerability for standalone LLMs, which often showed substantial performance degradation (e.g., Llama-3.1-8b accuracy dropped to 0.475 on permuted data). This architecture prioritizes safety and audibility by separating natural-language processing from core predictive inference.
Key takeaway
For machine learning engineers developing clinical decision support systems, you should prioritize hybrid architectures that separate natural language processing from predictive inference. Your systems should use LLMs for structured feature extraction from clinical notes, but delegate final risk prediction to robust, validated ML models like XGBoost, incorporating deterministic validation layers. This approach minimizes false negatives, enhances audibility, and improves robustness to input variations, offering a safer pathway for deploying AI in high-stakes medical triage.
Key insights
Hybrid AI systems improve clinical decision support by using LLMs for feature extraction and validated ML models for reliable, auditable prediction.
Principles
- LLMs are unreliable as standalone diagnostic engines in critical settings.
- Separating language interface from prediction enhances safety and audibility.
- Deterministic validation layers improve extracted data quality before inference.
Method
ClaMPAPP's workflow involves LLM-based feature extraction from narratives, deterministic validation of extracted values, XGBoost risk prediction, and structured clinician-facing report generation.
In practice
- Use LLMs for schema-constrained feature parsing from clinical notes.
- Integrate deterministic validation for extracted clinical data ranges.
- Employ XGBoost for robust, missing-value-aware risk prediction.
Topics
- Pediatric Appendicitis
- Hybrid AI Systems
- Large Language Models
- Clinical Decision Support
- XGBoost Classifier
- Feature Extraction
Best for: NLP Engineer, AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.