An Adaptive Data cleaning Framework for Noisy Label Detection

2026-06-08 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

A new self-adaptive data-cleaning framework addresses the challenge of noisy labels in deep neural network training, where over-parameterized models often memorize corrupted data, reducing accuracy. Unlike existing strategies that depend on manual thresholds or single metrics, this framework integrates local, global, and learning dynamics cues. It maps samples into a unified low-dimensional feature space using a modular feature concatenation paradigm. The framework offers two instantiations: a 2D metric combining class-adaptive KNN-based local disagreement with k-means-based global centroid distance, and a 3D multi-metric that additionally incorporates a z-normalized score. By performing multi-metric clustering on this feature space, it adaptively partitions samples into clean-dominant and noise-dominant components without requiring manual thresholds or noise priors. Experiments on CIFAR-10, MNIST, and ImageNet-100, with 5% to 40% symmetric label noise, demonstrated high recall, including near-perfect recall (>=98%) on ImageNet-100 at 40% noise. Subsequent model training consistently yielded accuracy gains, particularly under severe corruption.

Key takeaway

For Machine Learning Engineers training deep neural networks on real-world datasets, where label noise is common, you should consider integrating this self-adaptive data-cleaning framework. It offers a threshold-free approach to robustly detect noisy labels by combining multiple metrics, significantly improving model accuracy and generalization, especially under high corruption. This can streamline your data preparation workflow and reduce the need for extensive manual tuning or prior noise ratio knowledge.

Key insights

Multi-metric clustering on a unified feature space adaptively detects noisy labels without manual thresholds or noise priors.

Principles

Integrate local, global, and learning dynamics cues.
Map samples to a unified low-dimensional feature space.
Multi-metric clustering outperforms single-scalar GMMs.

Method

Map samples to a low-dimensional feature space via modular concatenation. Apply multi-metric clustering (e.g., 2D KNN/k-means or 3D with z-score) to partition samples, identifying clean-dominant and noise-dominant components adaptively.

In practice

Improve DNN accuracy under severe label corruption.
Reduce need for manual threshold tuning.
Apply to diverse datasets like CIFAR-10, MNIST, ImageNet-100.

Topics

Noisy Label Detection
Data Cleaning
Deep Neural Networks
Multi-metric Clustering
Computer Vision
Feature Engineering

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.