How Yelp Built “Yelp Assistant”

2025-12-15 · Source: ByteByteGo Newsletter · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, long

Summary

Building a production-ready AI assistant, exemplified by Yelp Assistant on business pages, requires evolving from simple prototypes to complex, multi-stage, evidence-grounded systems. This involves significant engineering challenges, architectural trade-offs, and practical lessons. Key shifts include a robust data strategy with explicit freshness targets and data separation for structured and unstructured content, moving from ad hoc batch dumps to streaming ingestion for high-velocity data. The inference pipeline transitions from a monolithic LLM to specialized models for retrieval, content source selection, keyword generation, and input guardrails, utilizing fine-tuned smaller models for analysis and a larger model for final generation. Serving efficiency is enhanced through streaming, parallelism, early stopping, and tiered models, reducing latency from over 10 seconds to under 3 seconds. A comprehensive evaluation system with LLM-as-a-judge for distinct quality dimensions is also crucial.

Key takeaway

For AI Engineers building conversational assistants, transitioning from prototype to production demands a strategic shift from monolithic LLMs to modular, evidence-grounded RAG systems. You should prioritize a robust data strategy with differentiated freshness and storage, implement specialized models for tasks like content selection and keyword generation, and optimize serving with streaming and parallelism to ensure low latency and cost-efficiency under real-world traffic.

Key insights

Production AI assistants demand multi-stage RAG systems with specialized models, robust data pipelines, and efficient serving for accuracy and safety.

Principles

Decouple retrieval and generation phases.
Separate data by type for optimal freshness and reliability.
Modularize guardrails for comprehensive safety.

Method

Implement Retrieval-Augmented Generation (RAG) with an offline indexing pipeline, real-time retrieval, and LLM-based generation with citation resolution. Employ tiered models for efficiency and fine-tune smaller models for question analysis.

In practice

Use streaming ingestion for high-velocity data like reviews.
Store structured facts in databases, unstructured content in search indices.
Employ a Keyword Generator to enhance search query effectiveness.

Topics

Retrieval-Augmented Generation
AI Assistant Systems
Production LLM Systems
Data Pipelines
Hybrid Retrieval

Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by ByteByteGo Newsletter.