How Spotify Taught an LLM to Think Like a Senior Data Analyst

2026-06-23 · Source: Artificial Intelligence on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Software Development & Engineering · Depth: Intermediate, medium

Summary

Spotify developed "Vedder," an internal AI data assistant, to address the challenge of LLMs failing due to insufficient data context rather than intelligence. Launched in August 2025, Vedder has been adopted by over 2,100 employees, facilitating 13,000+ conversations and 60,000+ messages across 177 data clusters covering diverse domains like advertising and finance. The system overcomes limitations of simply dumping schemas into LLMs, which struggle with 70,000+ datasets and 1.4 trillion daily data points, by implementing a "Context Layer." This layer, built on a "Cluster Model," organizes data knowledge into expert-owned domains comprising detailed datasets, vetted question-SQL examples, and tribal documentation. Crucially, Spotify found human curation of examples vastly superior to automation, with only 12.5% of auto-generated pairs being accepted. Vedder also employs a ReAct loop for SQL generation and continuous cluster health scores to maintain context accuracy.

Key takeaway

For AI Engineers or ML Architects building data-driven LLM applications, recognize that context engineering is more critical than prompt engineering. Your focus should shift from making models "smarter" to making the context smarter and ensuring domain experts own its curation. Implement structured knowledge layers, segmenting data into expert-managed domains with continuously monitored health scores. Prioritize manual curation of examples over automation to avoid importing noise, ensuring your system provides reliable, trusted answers.

Key insights

LLMs fail from lack of meaningful context, not intelligence; expert-owned context is paramount.

Principles

Schemas alone do not convey business meaning.
Automating context collection introduces significant noise.
Context decays and requires continuous monitoring.

Method

Implement a "Context Layer" using a "Cluster Model" where domain experts curate datasets, question-SQL pairs, and documentation for specific data domains, then use a ReAct loop for query generation.

In practice

Segment data knowledge into expert-owned domains (clusters).
Curate few-shot examples manually for higher quality.
Monitor context health scores to prevent decay.

Topics

Context Engineering
Large Language Models
Data Assistants
SQL Generation
Knowledge Management
Data Curation
ReAct Agent

Best for: AI Engineer, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence on Medium.