Building a Semantic Search API: From Half a Million Documents to Millisecond Queries

2026-02-16 · Source: NLP on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Data Science & Analytics · Depth: Intermediate, short

Summary

A semantic search API was developed to enable lightning-fast retrieval across approximately 500,000 articles, addressing the limitations of traditional keyword search. The system comprises an indexing pipeline and a retrieval server. The indexing pipeline processes a dataset from Hugging Face, generates 384-dimensional embeddings using a "all-MiniLM-L6-v2" Sentence Transformer model, and builds a FAISS `IndexFlatL2` index. This index, along with the original texts, is saved to disk as `my_faiss.index` (~760MB) and `my_texts.pkl` (~2.5GB). The retrieval server, built with FastAPI, loads these assets at startup and exposes a `/search` endpoint. This endpoint encodes incoming queries, performs a similarity search against the FAISS index, and maps the results back to the original documents, returning them in milliseconds.

Key takeaway

For AI Engineers building retrieval systems for large document collections, this architecture provides a robust blueprint. You should consider integrating FAISS with Sentence Transformers to achieve high-performance semantic search, especially when dealing with datasets of half a million documents or more. This setup forms the critical "R" component for future Retrieval-Augmented Generation (RAG) applications, enabling your LLMs to be grounded in specific, relevant data.

Key insights

Semantic search systems can achieve millisecond query times over large datasets using FAISS and Sentence Transformers.

Principles

Embeddings capture semantic meaning.
FAISS optimizes similarity search.
Separate indexing from retrieval.

Method

The method involves loading text data, generating embeddings with Sentence Transformers, building a FAISS `IndexFlatL2` index, saving the index and texts, and serving queries via a FastAPI endpoint that encodes queries and searches the index.

In practice

Use `all-MiniLM-L6-v2` for embeddings.
Store FAISS index and texts separately.
FastAPI can expose search functionality.

Topics

Semantic Search
FAISS
Sentence Transformers
FastAPI
Retrieval-Augmented Generation

Code references

nazanin-hsz/rag-news

Best for: Machine Learning Engineer, AI Engineer, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NLP on Medium.