When Evidence Conflicts: Uncertainty and Order Effects in Retrieval-Augmented Biomedical Question Answering
Summary
A study evaluated six open-weight large language models (LLMs) on biomedical question answering (QA) using the HealthContradict corpus, focusing on model reliability under conflicting evidence rather than just accuracy with helpful context. The research tested five evidence conditions: no context, correct-only, incorrect-only, and two mixed conditions with correct and contradictory documents in opposite orders (correct-first conflicting "CIC" and incorrect-first conflicting "ICC"). Results showed that accuracy dropped for every model in conflicting-evidence scenarios, with 11.4%–25.2% of predictions flipping when document order was reversed. The study also introduced a conflict-aware abstention score, combining model confidence with an evidence conflict detector, which improved selective accuracy by 7.2–33.4 points in the incorrect-only condition and 3.6–14.4 points in the incorrect-first conflicting condition, particularly at lower coverage levels.
Key takeaway
For Machine Learning Engineers developing biomedical QA systems, you must move beyond evaluating LLMs solely on helpful context. Your systems should be rigorously tested under misleading and conflicting evidence conditions, as these scenarios significantly impact accuracy and calibration. Implement conflict-aware abstention mechanisms, such as the proposed CAS, to improve reliability and enable your models to defer on uncertain cases, especially when evidence is contradictory or its order is unstable.
Key insights
Conflicting evidence and document order significantly degrade LLM accuracy and calibration in biomedical QA.
Principles
- Correct evidence improves LLM accuracy and calibration.
- Incorrect evidence actively harms LLM performance and calibration.
- Order of conflicting evidence impacts LLM predictions and confidence.
Method
A conflict-aware abstention score (CAS) combines model confidence with a logistic detector of confidently wrong predictions, trained on uncertainty signals and sentence embeddings, to improve selective accuracy in challenging conditions.
In practice
- Evaluate LLMs under misleading and mixed-evidence conditions.
- Implement conflict-aware abstention for biomedical QA systems.
- Consider reformulating conflicting documents into a synthesis.
Topics
- Retrieval-Augmented LLMs
- Biomedical Question Answering
- Conflicting Evidence
- Order Effects
- Uncertainty Estimation
Code references
Best for: AI Scientist, Research Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.