An Adaptive Data cleaning Framework for Noisy Label Detection
Summary
A new self-adaptive data-cleaning framework addresses the challenge of noisy labels in deep neural network training, where over-parameterized models often memorize corrupted data, reducing accuracy. Unlike existing strategies that depend on manual thresholds or single metrics, this framework integrates local, global, and learning dynamics cues. It maps samples into a unified low-dimensional feature space using a modular feature concatenation paradigm. The framework offers two instantiations: a 2D metric combining class-adaptive KNN-based local disagreement with k-means-based global centroid distance, and a 3D multi-metric that additionally incorporates a z-normalized score. By performing multi-metric clustering on this feature space, it adaptively partitions samples into clean-dominant and noise-dominant components without requiring manual thresholds or noise priors. Experiments on CIFAR-10, MNIST, and ImageNet-100, with 5% to 40% symmetric label noise, demonstrated high recall, including near-perfect recall (>=98%) on ImageNet-100 at 40% noise. Subsequent model training consistently yielded accuracy gains, particularly under severe corruption.
Key takeaway
For Machine Learning Engineers training deep neural networks on real-world datasets, where label noise is common, you should consider integrating this self-adaptive data-cleaning framework. It offers a threshold-free approach to robustly detect noisy labels by combining multiple metrics, significantly improving model accuracy and generalization, especially under high corruption. This can streamline your data preparation workflow and reduce the need for extensive manual tuning or prior noise ratio knowledge.
Key insights
Multi-metric clustering on a unified feature space adaptively detects noisy labels without manual thresholds or noise priors.
Principles
- Integrate local, global, and learning dynamics cues.
- Map samples to a unified low-dimensional feature space.
- Multi-metric clustering outperforms single-scalar GMMs.
Method
Map samples to a low-dimensional feature space via modular concatenation. Apply multi-metric clustering (e.g., 2D KNN/k-means or 3D with z-score) to partition samples, identifying clean-dominant and noise-dominant components adaptively.
In practice
- Improve DNN accuracy under severe label corruption.
- Reduce need for manual threshold tuning.
- Apply to diverse datasets like CIFAR-10, MNIST, ImageNet-100.
Topics
- Noisy Label Detection
- Data Cleaning
- Deep Neural Networks
- Multi-metric Clustering
- Computer Vision
- Feature Engineering
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.