Ensembles of Large Language Models for Identifying EQ-5D Studies in PubMed Based on Their Abstracts

2026-06-19 · Source: cs.CL updates on arXiv.org · Field: Science & Research — Health & Medical Research, Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, long

Summary

An ensemble-based framework utilizing Google's Gemini and Gemma large language models (LLMs) has been developed to automate the identification of EQ-5D studies in PubMed abstracts. This multi-phase approach integrates few-shot prompting, weighted ensembling, and a soft stacking meta-classifier. Evaluated on a dataset of 200 manually labeled PubMed studies, the weighted ensemble of gemini-2.5-pro, gemma-3-12b, and gemma-3-27b achieved a 0.74 weighted F1-score and 0.74 accuracy. This performance exceeded individual model results, improving the balance between precision and recall. The study also analyzed runtime and cost, noting that while gemini-2.5-pro had the highest performance, lighter models offered a practical balance of accuracy and cost-effectiveness, with costs ranging from 0.07 to 5.04 USD per run for 200 abstracts.

Key takeaway

For research scientists or ML engineers building automated literature screening tools, consider implementing ensemble LLM approaches for improved accuracy and reliability. Your systems can achieve a better balance of precision and recall by combining predictions from models like gemini-2.5-pro and gemma-3-12b. Evaluate the trade-off between model performance and inference costs, as lighter models can offer acceptable accuracy for scalable, resource-constrained deployments.

Key insights

Ensemble LLM frameworks reliably automate biomedical text classification, balancing performance and interpretability.

Principles

Ensembling LLMs reduces individual model biases.
Larger LLMs generally outperform smaller ones in biomedical contexts.
Probabilistic outputs are more discriminative than raw confidence scores.

Method

A multi-phase framework combines few-shot prompting, weighted ensemble aggregation based on F1-scores and confidence, and a soft stacking meta-classifier using logistic regression on model probabilities and confidences.

In practice

Use few-shot prompting for domain-specific LLM classification.
Combine top-performing LLMs via weighted ensembling for improved F1-score.
Employ soft stacking with logistic regression for enhanced reliability.

Topics

Large Language Models
Ensemble Learning
Systematic Literature Reviews
Biomedical Text Classification
EQ-5D
PubMed

Best for: NLP Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.