Data filtering methods for training language models

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Natural Language Processing · Depth: Expert, quick

Summary

A comparative analysis evaluated Confident Learning and Dataset Cartography, two automatic label error detection methods, on three Russian text classification corpora: ru_emotion_e-culture (49,123 examples), RuCoLA (8,524 examples), and TERRa (2,337 examples). Researchers fine-tuned the rubert-base-cased model on each corpus to assess filtering impact. Results indicate that method effectiveness strongly correlates with dataset characteristics. Specifically, filtering did not enhance performance on large corpora with low noise levels. However, Confident Learning achieved a significant F1-macro improvement on smaller, high-noise datasets. Dataset Cartography exhibited more conservative behavior, removing fewer examples. Across all datasets, targeted error removal by both methods consistently outperformed random removal of an equivalent number of examples, confirming their meaningfulness.

Key takeaway

For Machine Learning Engineers focused on training language models, especially with smaller or potentially noisy datasets, you should integrate automatic label error detection. Prioritize methods like Confident Learning, which demonstrated significant F1-macro improvements on high-noise data, over simple random removal. Your data preparation strategy should adapt to dataset characteristics; filtering may not be beneficial for large, clean corpora, but it is crucial for improving generalization on less ideal datasets.

Key insights

Label error detection methods improve model performance, particularly on small, noisy datasets, outperforming random data removal.

Principles

Method

Conducted a comparative analysis of Confident Learning and Dataset Cartography for label error detection in text classification, using rubert-base-cased and control experiments.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.