Encoding Your Domain Expert: The Context Layer Behind Spotify's Data Assistant
Summary
Spotify developed "Vedder," an AI data assistant, to address the overwhelming demand for data insights from its 70,000+ datasets, which process 1.4 trillion data points daily. Traditional LLM approaches failed due to limited context windows and schemas not conveying critical business logic. Vedder, actively utilized since August 2025 by over 2,100 Spotifiers across 13,000+ conversations, operates on a "cluster model" where domain experts curate "clusters" of data. Each cluster comprises relevant datasets with full schemas, vetted question-and-SQL example "pairs," and additional business "docs." This human-curated context is crucial for trustworthiness; a trial showed experts accepted only 12.5% of automatically generated question-SQL pairs from query history, highlighting the noise in raw data. Clusters are continuously monitored via health scores, prompting experts to update context as data evolves.
Key takeaway
For AI Architects or Data Scientists building internal data assistants, relying solely on raw schemas or query logs for LLM context is insufficient and untrustworthy. You should empower domain experts to curate and own specific data "clusters" with vetted examples and business context. This approach ensures accuracy and scalability, transforming experts from answering one-off questions to shaping a reliable knowledge layer that serves thousands. Continuously monitor context health to prevent degradation and maintain trust.
Key insights
Human-curated context, not raw schemas, is essential for trustworthy AI data assistants at scale.
Principles
- Data experts must own context curation.
- Raw query history contains significant noise.
- Trustworthy AI requires human judgment.
Method
Spotify's data agent uses a ReAct loop, selecting context, writing SQL, running queries, and returning answers with sources.
In practice
- Implement a "cluster model" for domain-specific context.
- Monitor cluster health scores to maintain context validity.
- Integrate user feedback to refine knowledge bases.
Topics
- AI Data Assistant
- Context Curation
- LLM Applications
- Data Governance
- SQL Generation
- ReAct Framework
Best for: AI Product Manager, Product Manager, CTO, AI Engineer, Data Scientist, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Spotify Engineering.