Not your usual search system — feat. BM25, FAISS and XGBoost
Summary
A full end-to-end hybrid search engine was developed and deployed on Google Cloud Run using the StackOverflow Stacksample dataset, comprising 1.5 million questions and 2 million answers. This system integrates a four-stage pipeline: BM25 lexical retrieval, FAISS dense retrieval, hybrid merging, and an XGBoost reranker. The project demonstrates how to combine keyword and semantic search techniques to overcome the limitations of each, processing raw HTML text into clean, tokenized documents. The architecture is modular, allowing independent scaling and upgrades for each stage. Evaluation using Recall@10, MRR@10, and nDCG@10 showed significant performance improvements, with hybridization providing the largest gains and the XGBoost reranker further enhancing ranking quality by 31% over the BM25 baseline. The system is served via a FastAPI endpoint, with artifacts stored on Google Cloud Storage.
Key takeaway
For AI Engineers building robust search or recommendation systems, integrating a multi-stage hybrid retrieval and reranking pipeline is crucial. Your systems will achieve superior recall and ranking quality by combining lexical (BM25) and semantic (FAISS) methods, then refining results with a learned model like XGBoost. Focus on modularity to ensure scalability and maintainability, and evaluate with metrics like MRR and nDCG to capture true user experience.
Key insights
Production-grade search systems combine lexical and semantic retrieval with learning-to-rank for optimal performance.
Principles
- Hybrid retrieval outperforms single-method approaches.
- Modular search pipelines enable independent scaling.
- Reranking improves result quality, not just recall.
Method
A four-stage pipeline: BM25 for lexical retrieval, FAISS IVF+PQ for dense retrieval, hybrid merging of candidates, and XGBoost for learning-to-rank reranking, deployed via FastAPI on Cloud Run.
In practice
- Use BeautifulSoup for HTML cleaning in text data.
- Employ IVF+PQ for scalable dense vector indexing.
- Train rerankers with "rank:ndcg" objective for ordering.
Topics
- Hybrid Search Architecture
- BM25 Lexical Retrieval
- FAISS IVF+PQ
- XGBoost Reranking
- Learning-to-Rank
Code references
Best for: AI Engineer, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Data Science on Medium.