How Yelp Built “Yelp Assistant”

· Source: ByteByteGo Newsletter · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, long

Summary

Building a production-ready AI assistant, exemplified by Yelp Assistant on business pages, requires evolving from simple prototypes to complex, multi-stage, evidence-grounded systems. This involves significant engineering challenges, architectural trade-offs, and practical lessons. Key shifts include a robust data strategy with explicit freshness targets and data separation for structured and unstructured content, moving from ad hoc batch dumps to streaming ingestion for high-velocity data. The inference pipeline transitions from a monolithic LLM to specialized models for retrieval, content source selection, keyword generation, and input guardrails, utilizing fine-tuned smaller models for analysis and a larger model for final generation. Serving efficiency is enhanced through streaming, parallelism, early stopping, and tiered models, reducing latency from over 10 seconds to under 3 seconds. A comprehensive evaluation system with LLM-as-a-judge for distinct quality dimensions is also crucial.

Key takeaway

For AI Engineers building conversational assistants, transitioning from prototype to production demands a strategic shift from monolithic LLMs to modular, evidence-grounded RAG systems. You should prioritize a robust data strategy with differentiated freshness and storage, implement specialized models for tasks like content selection and keyword generation, and optimize serving with streaming and parallelism to ensure low latency and cost-efficiency under real-world traffic.

Key insights

Production AI assistants demand multi-stage RAG systems with specialized models, robust data pipelines, and efficient serving for accuracy and safety.

Principles

Method

Implement Retrieval-Augmented Generation (RAG) with an offline indexing pipeline, real-time retrieval, and LLM-based generation with citation resolution. Employ tiered models for efficiency and fine-tune smaller models for question analysis.

In practice

Topics

Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by ByteByteGo Newsletter.