The Revenge of the Data Scientist

2026-03-25 · Source: Hamel Husain's Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, extended

Summary

Hamel Husain's "The Revenge of the Data Scientist" argues that despite the rise of Large Language Models (LLMs) and foundation model APIs, the core skills of data science remain critical for shipping effective AI. The article, based on a PyAI Conf talk, contends that while model training may be less central, the essential work of setting up experiments, debugging stochastic systems, and designing robust metrics persists. Husain highlights five common pitfalls in LLM evaluations (evals): using generic metrics, unverified LLM judges, bad experimental design, poor data and labels, and over-automating human work. He emphasizes that these issues stem from neglecting fundamental data science practices, such as exploratory data analysis, model evaluation, and rigorous experimental design, asserting that "the harness is data science."

Key takeaway

For AI Engineers and Data Scientists building LLM-powered applications, you must re-emphasize fundamental data science practices in your evaluation pipelines. Avoid generic metrics and unverified LLM judges; instead, look at your raw data, design application-specific metrics, and rigorously validate your evaluation systems. This approach will help you diagnose failures, prioritize improvements, and build more reliable AI products.

Key insights

Core data science skills are essential for effective LLM evaluation and robust AI system development.

Principles

The "harness" of AI systems is largely data science.
Application-specific metrics are superior to generic ones.
Human judgment and data exploration are irreplaceable.

Method

To improve LLM evals, treat LLM judges as classifiers, verify them with human labels, and use precision/recall. Design application-specific metrics by exploring traces and categorizing failures.

In practice

Code a custom trace viewer for domain-specific data quirks.
Use human labels to validate LLM judges against a test set.
Base synthetic test data on real production logs and traces.

Topics

Data Science Roles
LLM Evaluation
Harness Engineering
Experimental Design
AI Metrics

Code references

hamelsmu/evals-skills

Best for: Data Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Hamel Husain's Blog.