Data filtering methods for training language models

2026-05-28 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Natural Language Processing · Depth: Expert, quick

Summary

A comparative analysis evaluated Confident Learning and Dataset Cartography, two automatic label error detection methods, on three Russian text classification corpora: ru_emotion_e-culture (49,123 examples), RuCoLA (8,524 examples), and TERRa (2,337 examples). Researchers fine-tuned the rubert-base-cased model on each corpus to assess filtering impact. Results indicate that method effectiveness strongly correlates with dataset characteristics. Specifically, filtering did not enhance performance on large corpora with low noise levels. However, Confident Learning achieved a significant F1-macro improvement on smaller, high-noise datasets. Dataset Cartography exhibited more conservative behavior, removing fewer examples. Across all datasets, targeted error removal by both methods consistently outperformed random removal of an equivalent number of examples, confirming their meaningfulness.

Key takeaway

For Machine Learning Engineers focused on training language models, especially with smaller or potentially noisy datasets, you should integrate automatic label error detection. Prioritize methods like Confident Learning, which demonstrated significant F1-macro improvements on high-noise data, over simple random removal. Your data preparation strategy should adapt to dataset characteristics; filtering may not be beneficial for large, clean corpora, but it is crucial for improving generalization on less ideal datasets.

Key insights

Label error detection methods improve model performance, particularly on small, noisy datasets, outperforming random data removal.

Principles

Data quality significantly impacts ML model generalization.
Label error detection method efficacy varies with dataset characteristics.
Targeted data filtering is superior to random data removal.

Method

Conducted a comparative analysis of Confident Learning and Dataset Cartography for label error detection in text classification, using rubert-base-cased and control experiments.

In practice

Apply Confident Learning to small, high-noise text datasets.
Evaluate Dataset Cartography for conservative error filtering.

Topics

Data Filtering
Label Error Detection
Confident Learning
Dataset Cartography
Text Classification
rubert-base-cased

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.