Not your usual search system — feat. BM25, FAISS and XGBoost

2026-04-11 · Source: Data Science on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Data Science & Analytics · Depth: Intermediate, medium

Summary

A full end-to-end hybrid search engine was developed and deployed on Google Cloud Run using the StackOverflow Stacksample dataset, comprising 1.5 million questions and 2 million answers. This system integrates a four-stage pipeline: BM25 lexical retrieval, FAISS dense retrieval, hybrid merging, and an XGBoost reranker. The project demonstrates how to combine keyword and semantic search techniques to overcome the limitations of each, processing raw HTML text into clean, tokenized documents. The architecture is modular, allowing independent scaling and upgrades for each stage. Evaluation using Recall@10, MRR@10, and nDCG@10 showed significant performance improvements, with hybridization providing the largest gains and the XGBoost reranker further enhancing ranking quality by 31% over the BM25 baseline. The system is served via a FastAPI endpoint, with artifacts stored on Google Cloud Storage.

Key takeaway

For AI Engineers building robust search or recommendation systems, integrating a multi-stage hybrid retrieval and reranking pipeline is crucial. Your systems will achieve superior recall and ranking quality by combining lexical (BM25) and semantic (FAISS) methods, then refining results with a learned model like XGBoost. Focus on modularity to ensure scalability and maintainability, and evaluate with metrics like MRR and nDCG to capture true user experience.

Key insights

Production-grade search systems combine lexical and semantic retrieval with learning-to-rank for optimal performance.

Principles

Hybrid retrieval outperforms single-method approaches.
Modular search pipelines enable independent scaling.
Reranking improves result quality, not just recall.

Method

A four-stage pipeline: BM25 for lexical retrieval, FAISS IVF+PQ for dense retrieval, hybrid merging of candidates, and XGBoost for learning-to-rank reranking, deployed via FastAPI on Cloud Run.

In practice

Use BeautifulSoup for HTML cleaning in text data.
Employ IVF+PQ for scalable dense vector indexing.
Train rerankers with "rank:ndcg" objective for ordering.

Topics

Hybrid Search Architecture
BM25 Lexical Retrieval
FAISS IVF+PQ
XGBoost Reranking
Learning-to-Rank

Code references

PragatiBagul/stackoverflow-search

Best for: AI Engineer, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Data Science on Medium.