Automated Detection of Dosing Errors in Clinical Trial Narratives: A Multi-Modal Feature Engineering Approach with LightGBM

2026-04-24 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Clinical Natural Language Processing · Depth: Expert, extended

Summary

A new automated system detects dosing errors in unstructured clinical trial narratives, achieving a 0.8725 test ROC-AUC on the CT-DEB benchmark dataset. The system employs a LightGBM model trained on 3,451 multi-modal features, including traditional NLP (TF-IDF, character n-grams), dense semantic embeddings (all-MiniLM-L6-v2), domain-specific medical patterns, and transformer-based scores (BiomedBERT, DeBERTa-v3). Features are extracted from nine complementary text fields, ensuring comprehensive coverage across 42,112 narratives with a severe class imbalance (4.9% positive rate). Ablation studies showed sentence embeddings are critical, causing a 2.39% performance drop when removed. Feature efficiency analysis revealed that selecting the top 500-1000 features yields optimal performance (0.886-0.887 AUC), outperforming the full feature set by reducing noise.

Key takeaway

For NLP Engineers developing clinical trial quality assurance systems, this research demonstrates that a hybrid feature engineering approach, combining traditional NLP with dense embeddings and aggressive feature selection, can significantly improve automated dosing error detection. You should consider implementing a similar multi-modal feature set and rigorously optimizing feature selection to enhance both model accuracy and computational efficiency, especially when dealing with highly imbalanced clinical text data.

Key insights

Combining sparse lexical features with dense semantic embeddings improves clinical dosing error detection.

Principles

Feature selection acts as regularization, improving model performance.
Sparse and dense features offer complementary signals in specialized text.
ROC-AUC is robust for imbalanced classification tasks.

Method

A LightGBM model is trained on 3,451 multi-modal features, including TF-IDF, character n-grams, all-MiniLM-L6-v2 embeddings, medical patterns, and transformer scores, with Optuna-optimized hyperparameters and 5-fold ensemble averaging.

In practice

Use LightGBM for high-dimensional, sparse clinical text data.
Prioritize sentence embeddings for semantic capture.
Apply feature selection to optimize accuracy and efficiency.

Topics

Dosing Error Detection
Clinical Trial Narratives
Multi-Modal Feature Engineering
LightGBM
Sentence Embeddings

Code references

msmadi/Clinical-Trial-Dosing-Error

Best for: NLP Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.