An Adaptive Data cleaning Framework for Noisy Label Detection

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

An adaptive data-cleaning framework is proposed to address the issue of deep neural networks memorizing noisy labels, which degrades model accuracy and generalization. Existing data-cleaning strategies often rely on manual thresholds, prior knowledge of noise ratios, or single metrics, leading to instability. This new self-adaptive framework integrates local, global, and learning dynamics cues for robust noisy-label detection. It maps samples into a unified low-dimensional feature space using a modular feature concatenation paradigm. Two instantiations are provided: a 2D metric combining class-adaptive KNN-based local disagreement with k-means-based global centroid distance, and a 3D multi-metric that adds a z-normalized score. Unlike conventional 1D Gaussian Mixture Models, this framework performs multi-metric clustering on the feature space to adaptively partition samples into clean-dominant and noise-dominant components without requiring manual thresholds or noise priors. Experiments on CIFAR-10, MNIST, and ImageNet-100 with 5% to 40% symmetric label noise demonstrate high recall, including near-perfect recall (>=98%) on ImageNet-100 at 40% noise, and subsequent accuracy gains.

Key takeaway

For Machine Learning Engineers dealing with noisy labels in computer vision datasets, this adaptive data-cleaning framework offers a robust solution. You should consider integrating multi-metric cues like local disagreement, global centroid distance, and z-normalized scores to detect noisy samples without manual threshold tuning or prior noise knowledge. This approach can significantly improve model accuracy and generalization, especially under severe label corruption, as demonstrated by its high recall on ImageNet-100 at 40% noise.

Key insights

A self-adaptive framework integrates multi-metric cues for robust noisy-label detection, avoiding manual thresholds and noise priors.

Principles

Method

Map samples to a low-dimensional feature space via modular concatenation. Apply multi-metric clustering (e.g., 2D KNN/k-means or 3D with z-score) to adaptively partition clean and noisy components.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.