Building a Semantic Search Engine for Patent Prior Art Discovery Using SBERT and FAISS
Summary
A capstone project developed an AI-based patent retrieval system for prior art discovery, comparing TF-IDF, BM25, and Sentence-BERT (SBERT) methods. The system converts patent abstracts into dense embeddings using SBERT and employs FAISS for scalable similarity search. Evaluation, using ranking-based metrics like Hit@K and Cumulative Match Characteristic, demonstrated that SBERT consistently outperformed lexical baselines (TF-IDF and BM25), particularly in ranking relevant patents higher. While SBERT improved semantic matching, the project identified limitations, such as difficulty with patents containing multiple concepts and the inherent noise in using patent citations as ground truth for relevance. A local Gradio demo was also built to facilitate interaction with the retrieval system.
Key takeaway
For AI Engineers building patent search systems, integrating semantic embeddings like SBERT with FAISS can substantially improve prior art discovery by capturing conceptual similarity. Your focus should extend beyond model performance to include robust evaluation, thorough failure analysis, and careful system design to address real-world complexities like noisy ground truth and multi-concept patents. Consider hybrid retrieval and fine-tuning embeddings on domain-specific data for further gains.
Key insights
Semantic embeddings significantly enhance patent prior art discovery by retrieving documents based on meaning, not just keyword overlap.
Principles
- Semantic retrieval outperforms lexical methods for conceptual similarity.
- Early retrieval of relevant results is critical for practical search systems.
- Citation relationships offer a scalable, albeit noisy, relevance signal.
Method
Convert patent abstracts to SBERT embeddings, index them with FAISS, and perform cosine similarity search to rank candidates, comparing against TF-IDF and BM25 baselines.
In practice
- Use SBERT for conceptual patent search.
- Integrate FAISS for scalable embedding search.
- Evaluate with ranking metrics like Hit@K.
Topics
- Patent Prior Art Discovery
- Semantic Search Engine
- Sentence-BERT
- FAISS Indexing
- Information Retrieval Metrics
Best for: Machine Learning Engineer, AI Engineer, AI Student
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Naturallanguageprocessing on Medium.