Scaling Search Relevance: Augmenting App Store Ranking with LLM-Generated Judgments

2026-02-27 · Source: Apple Machine Learning Research · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, quick

Summary

Apple's App Store search system has been augmented with Large Language Model (LLM)-generated judgments to enhance search relevance. The system optimizes for both behavioral relevance (user clicks/downloads) and textual relevance (semantic fit to query). Facing a scarcity of expert-provided textual relevance labels, researchers fine-tuned a specialized LLM, which significantly outperformed a larger pre-trained model in generating high-quality labels. This optimal LLM was then used to generate millions of textual relevance labels, addressing the data scarcity issue. Integrating these LLM-generated labels into the production ranker resulted in improved offline Normalized Discounted Cumulative Gain (NDCG) for both behavioral and textual relevance. A worldwide A/B test confirmed these gains, showing a statistically significant +0.24% increase in conversion rate, particularly benefiting tail queries where behavioral data is sparse.

Key takeaway

For NLP Engineers optimizing search relevance in large-scale commercial systems, you should consider fine-tuning specialized LLMs to generate high-quality textual relevance labels. This approach can effectively overcome data scarcity, improve both behavioral and textual relevance metrics, and lead to measurable conversion rate increases, especially for long-tail queries where traditional behavioral signals are weak.

Key insights

LLM-generated textual relevance labels significantly improve App Store search ranking and conversion rates.

Principles

Specialized LLMs outperform larger general models for specific tasks.
Augmenting sparse data with LLM-generated labels boosts system performance.

Method

A specialized, fine-tuned LLM was used to generate millions of textual relevance labels, which were then integrated into a production search ranker to improve both behavioral and textual relevance.

In practice

Fine-tune LLMs for domain-specific label generation.
Use LLM-generated data to address data scarcity in ranking systems.
Prioritize tail queries for LLM-based relevance improvements.

Topics

LLM-Generated Labels
Search Relevance
App Store Ranking
Textual Relevance
A/B Testing

Best for: NLP Engineer, AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Apple Machine Learning Research.