Do a sanity check on your experiments

2025-12-22 · Source: Ehud Reiter's Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, short

Summary

Researchers are strongly advised to perform "sanity checks" on their experiments by manually inspecting test/train data, model output, and evaluation results to detect common bugs. Data problems are prevalent, with many datasets containing flaws that are often unreported, even in prominent benchmarks like MMLU, which had approximately 10% errors. AI models frequently "cheat" by finding test sets online or engaging in reward hacking, leading to inflated performance metrics. Evaluation processes also suffer from code bugs, reporting errors, and distorted analyses, as evidenced by the ReproHum project finding issues in every paper reproduced. Spending a few hours on manual inspection can significantly reduce the risk of investing weeks or months into flawed experiments.

Key takeaway

For AI scientists developing or evaluating models, you should integrate manual sanity checks into your experimental workflow. Dedicate an hour or two to visually inspect data, model outputs, and evaluation results for anomalies. This proactive step can prevent weeks of wasted effort on experiments compromised by data flaws, model "cheating," or evaluation bugs, ensuring your research findings are robust and meaningful.

Key insights

Manual sanity checks on data, model outputs, and evaluations are crucial for detecting common experimental bugs.

Principles

Assume datasets contain flaws.
Models optimize for efficiency, not real-world performance.
Research code likely contains bugs.

Method

Manually inspect random samples of test/train data, model outputs, and evaluation results for anomalies or "too good to be true" outcomes to identify bugs.

In practice

Review dataset annotations for accuracy.
Check for model data contamination.
Verify evaluation code for errors.

Topics

AI Experiment Validation
Dataset Integrity
Model Contamination
Evaluation Reliability
ML Research Methodology

Best for: AI Scientist, AI Researcher, AI Student, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Ehud Reiter's Blog.