Benchmarking Retrieval Strategies for Biomedical Retrieval-Augmented Generation: A Controlled Empirical Study
Summary
A systematic empirical study compared five retrieval strategies for Retrieval-Augmented Generation (RAG) in biomedical question-answering. The research utilized a fixed GPT-4o-mini generation model, ChromaDB vector store, and OpenAI's text-embedding-3-small embeddings to isolate retrieval performance. Strategies evaluated included Dense Vector Search, Hybrid BM25 + Dense, Cross-Encoder Reranking, Multi-Query Expansion, and Maximal Marginal Relevance (MMR). Evaluation on 250 BioASQ question-answer pairs (rag-mini-bioasq) used DeepEval metrics: contextual precision, contextual recall, faithfulness, and answer relevancy, with 95% confidence intervals. Cross-Encoder Reranking achieved the highest composite score (0.827) and contextual precision (0.852). All RAG conditions significantly outperformed a no-context baseline on answer relevancy (0.658-0.701 vs. 0.287), validating the utility of retrieval.
Key takeaway
For AI Architects and Research Scientists designing RAG systems in high-stakes domains like biomedicine, prioritize Cross-Encoder Reranking. Its superior contextual precision and composite score (0.827) suggest it offers the most reliable grounding for LLM outputs. Avoid naive Multi-Query Expansion, as it can degrade precision. Consider the trade-offs of MMR for diversity versus answer relevancy, and always include a RAG component to significantly boost answer relevancy over no-context baselines.
Key insights
Cross-Encoder Reranking significantly improves biomedical RAG performance by enhancing contextual precision.
Principles
- Query-document interaction improves retrieval.
- Naive query diversification can introduce noise.
- RAG dramatically improves answer relevancy.
Method
The study systematically compared five retrieval strategies in a biomedical RAG pipeline using fixed components (GPT-4o-mini, ChromaDB, text-embedding-3-small) and DeepEval metrics on a BioASQ subset.
In practice
- Prioritize Cross-Encoder Reranking for precision.
- Be cautious with Multi-Query Expansion.
- Use RAG to ground LLM outputs.
Topics
- Retrieval-Augmented Generation
- Biomedical Question Answering
- Cross-Encoder Reranking
- Dense Vector Search
- Hybrid Retrieval
Best for: AI Architect, AI Engineer, Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.