How Yelp Built “Yelp Assistant”
Summary
Building a production-ready AI assistant, exemplified by Yelp Assistant on business pages, requires evolving from simple prototypes to complex, multi-stage, evidence-grounded systems. This involves significant engineering challenges, architectural trade-offs, and practical lessons. Key shifts include a robust data strategy with explicit freshness targets and data separation for structured and unstructured content, moving from ad hoc batch dumps to streaming ingestion for high-velocity data. The inference pipeline transitions from a monolithic LLM to specialized models for retrieval, content source selection, keyword generation, and input guardrails, utilizing fine-tuned smaller models for analysis and a larger model for final generation. Serving efficiency is enhanced through streaming, parallelism, early stopping, and tiered models, reducing latency from over 10 seconds to under 3 seconds. A comprehensive evaluation system with LLM-as-a-judge for distinct quality dimensions is also crucial.
Key takeaway
For AI Engineers building conversational assistants, transitioning from prototype to production demands a strategic shift from monolithic LLMs to modular, evidence-grounded RAG systems. You should prioritize a robust data strategy with differentiated freshness and storage, implement specialized models for tasks like content selection and keyword generation, and optimize serving with streaming and parallelism to ensure low latency and cost-efficiency under real-world traffic.
Key insights
Production AI assistants demand multi-stage RAG systems with specialized models, robust data pipelines, and efficient serving for accuracy and safety.
Principles
- Decouple retrieval and generation phases.
- Separate data by type for optimal freshness and reliability.
- Modularize guardrails for comprehensive safety.
Method
Implement Retrieval-Augmented Generation (RAG) with an offline indexing pipeline, real-time retrieval, and LLM-based generation with citation resolution. Employ tiered models for efficiency and fine-tune smaller models for question analysis.
In practice
- Use streaming ingestion for high-velocity data like reviews.
- Store structured facts in databases, unstructured content in search indices.
- Employ a Keyword Generator to enhance search query effectiveness.
Topics
- Retrieval-Augmented Generation
- AI Assistant Systems
- Production LLM Systems
- Data Pipelines
- Hybrid Retrieval
Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by ByteByteGo Newsletter.