A better way to describe three-way data splits

2021-04-04 · Source: Brian Spiering’s Newsletter · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, quick

Summary

The standard terminology for three-way data splits in machine learning often causes confusion among learners due to a lack of precision. This article proposes a hierarchical and sequential approach to describing data splitting, starting with an "all-dataset." This "all-dataset" is first divided into a "train-primary-dataset" and a "test-primary-dataset." Subsequently, the "train-primary-dataset" is further split into "train-current-dataset" and "validation-current-dataset," often through k-fold cross-validation. This refined terminology clarifies the distinct roles of each dataset: the "train-current-dataset" for parameter optimization, the "validation-current-dataset" for hyperparameter tuning, and the "test-primary-dataset" for evaluating model generalization. Ultimately, the best hyperparameters are used to retrain on the "all-dataset" for production deployment.

Key takeaway

For machine learning educators and practitioners designing model training pipelines, adopting the proposed hierarchical data split terminology can significantly reduce ambiguity. Your team should clearly define "all-dataset," "train-primary-dataset," "test-primary-dataset," "train-current-dataset," and "validation-current-dataset" to ensure consistent understanding of each dataset's role in parameter optimization, hyperparameter tuning, and generalization assessment. This clarity will streamline model development and deployment.

Key insights

Precise, hierarchical terminology for data splits improves clarity in machine learning education and practice.

Principles

Separate test data for generalization evaluation.
Hyperparameter search uses train/validation splits.

Method

Divide "all-dataset" into "train-primary-dataset" and "test-primary-dataset." Then, split "train-primary-dataset" into "train-current-dataset" and "validation-current-dataset" for iterative tuning.

In practice

Use "test-primary-dataset" for final generalization check.
Retrain on "all-dataset" with best hyperparameters for production.

Topics

Data Splitting
Machine Learning Terminology
Hyperparameter Tuning
Model Generalization
K-fold Cross-Validation

Best for: AI Engineer, NLP Engineer, Computer Vision Engineer, Machine Learning Engineer, AI Student, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Brian Spiering’s Newsletter.