Scaling Search Relevance: Augmenting App Store Ranking with LLM-Generated Judgments
Summary
Apple's App Store search system has been augmented with Large Language Model (LLM)-generated judgments to enhance search relevance. The system optimizes for both behavioral relevance (user clicks/downloads) and textual relevance (semantic fit to query). Facing a scarcity of expert-provided textual relevance labels, researchers fine-tuned a specialized LLM, which significantly outperformed a larger pre-trained model in generating high-quality labels. This optimal LLM was then used to generate millions of textual relevance labels, addressing the data scarcity issue. Integrating these LLM-generated labels into the production ranker resulted in improved offline Normalized Discounted Cumulative Gain (NDCG) for both behavioral and textual relevance. A worldwide A/B test confirmed these gains, showing a statistically significant +0.24% increase in conversion rate, particularly benefiting tail queries where behavioral data is sparse.
Key takeaway
For NLP Engineers optimizing search relevance in large-scale commercial systems, you should consider fine-tuning specialized LLMs to generate high-quality textual relevance labels. This approach can effectively overcome data scarcity, improve both behavioral and textual relevance metrics, and lead to measurable conversion rate increases, especially for long-tail queries where traditional behavioral signals are weak.
Key insights
LLM-generated textual relevance labels significantly improve App Store search ranking and conversion rates.
Principles
- Specialized LLMs outperform larger general models for specific tasks.
- Augmenting sparse data with LLM-generated labels boosts system performance.
Method
A specialized, fine-tuned LLM was used to generate millions of textual relevance labels, which were then integrated into a production search ranker to improve both behavioral and textual relevance.
In practice
- Fine-tune LLMs for domain-specific label generation.
- Use LLM-generated data to address data scarcity in ranking systems.
- Prioritize tail queries for LLM-based relevance improvements.
Topics
- LLM-Generated Labels
- Search Relevance
- App Store Ranking
- Textual Relevance
- A/B Testing
Best for: NLP Engineer, AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Apple Machine Learning Research.