Data filtering methods for training language models
Summary
A comparative analysis evaluated Confident Learning and Dataset Cartography, two automatic label error detection methods, on three Russian text classification corpora: ru_emotion_e-culture (49,123 examples), RuCoLA (8,524 examples), and TERRa (2,337 examples). Researchers fine-tuned the rubert-base-cased model on each corpus to assess filtering impact. Results indicate that method effectiveness strongly correlates with dataset characteristics. Specifically, filtering did not enhance performance on large corpora with low noise levels. However, Confident Learning achieved a significant F1-macro improvement on smaller, high-noise datasets. Dataset Cartography exhibited more conservative behavior, removing fewer examples. Across all datasets, targeted error removal by both methods consistently outperformed random removal of an equivalent number of examples, confirming their meaningfulness.
Key takeaway
For Machine Learning Engineers focused on training language models, especially with smaller or potentially noisy datasets, you should integrate automatic label error detection. Prioritize methods like Confident Learning, which demonstrated significant F1-macro improvements on high-noise data, over simple random removal. Your data preparation strategy should adapt to dataset characteristics; filtering may not be beneficial for large, clean corpora, but it is crucial for improving generalization on less ideal datasets.
Key insights
Label error detection methods improve model performance, particularly on small, noisy datasets, outperforming random data removal.
Principles
- Data quality significantly impacts ML model generalization.
- Label error detection method efficacy varies with dataset characteristics.
- Targeted data filtering is superior to random data removal.
Method
Conducted a comparative analysis of Confident Learning and Dataset Cartography for label error detection in text classification, using rubert-base-cased and control experiments.
In practice
- Apply Confident Learning to small, high-noise text datasets.
- Evaluate Dataset Cartography for conservative error filtering.
Topics
- Data Filtering
- Label Error Detection
- Confident Learning
- Dataset Cartography
- Text Classification
- rubert-base-cased
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.