When Evidence Conflicts: Uncertainty and Order Effects in Retrieval-Augmented Biomedical Question Answering

2026-05-15 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Health & Medical Research · Depth: Expert, extended

Summary

A study evaluated six open-weight large language models (LLMs) on biomedical question answering (QA) using the HealthContradict corpus, focusing on model reliability under conflicting evidence rather than just accuracy with helpful context. The research tested five evidence conditions: no context, correct-only, incorrect-only, and two mixed conditions with correct and contradictory documents in opposite orders (correct-first conflicting "CIC" and incorrect-first conflicting "ICC"). Results showed that accuracy dropped for every model in conflicting-evidence scenarios, with 11.4%–25.2% of predictions flipping when document order was reversed. The study also introduced a conflict-aware abstention score, combining model confidence with an evidence conflict detector, which improved selective accuracy by 7.2–33.4 points in the incorrect-only condition and 3.6–14.4 points in the incorrect-first conflicting condition, particularly at lower coverage levels.

Key takeaway

For Machine Learning Engineers developing biomedical QA systems, you must move beyond evaluating LLMs solely on helpful context. Your systems should be rigorously tested under misleading and conflicting evidence conditions, as these scenarios significantly impact accuracy and calibration. Implement conflict-aware abstention mechanisms, such as the proposed CAS, to improve reliability and enable your models to defer on uncertain cases, especially when evidence is contradictory or its order is unstable.

Key insights

Conflicting evidence and document order significantly degrade LLM accuracy and calibration in biomedical QA.

Principles

Correct evidence improves LLM accuracy and calibration.
Incorrect evidence actively harms LLM performance and calibration.
Order of conflicting evidence impacts LLM predictions and confidence.

Method

A conflict-aware abstention score (CAS) combines model confidence with a logistic detector of confidently wrong predictions, trained on uncertainty signals and sentence embeddings, to improve selective accuracy in challenging conditions.

In practice

Evaluate LLMs under misleading and mixed-evidence conditions.
Implement conflict-aware abstention for biomedical QA systems.
Consider reformulating conflicting documents into a synthesis.

Topics

Retrieval-Augmented LLMs
Biomedical Question Answering
Conflicting Evidence
Order Effects
Uncertainty Estimation

Code references

YikunHan42/When_Evidence_Conflicts

Best for: AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.