CMDS-AD: Cross-Modal Dual-Stream Decoupling for Few-Shot Anomaly Detection

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

CMDS-AD, a Cross-Modal Dual-Stream Anomaly Detection framework, addresses the challenges of few-shot anomaly detection, particularly in multi-modal settings where limited training data and spatially uniform feature processing in existing methods lead to cross-modal misalignment and high false-positive rates. The framework employs a LoRA-guided diffusion model to generate diverse RGB samples, mitigating extreme data scarcity. For 3D normal augmentation, a pre-trained diffusion model functions as a non-linear low-pass filter, extracting low-frequency normal representations from RGB inputs. This creates an auxiliary stream for robust structural templates, aiding the uncompressed real stream in isolating micro-defects. Further enhancements include a Coordinate-Aware Hierarchical Feature Mapper for semantic alignment and a multiplicative scoring mechanism to filter modality-specific noise. Under a 1-shot setting, CMDS-AD achieved absolute performance gains of 5.7% (I-AUROC) and 2.0% (AUPRO) on MVTec 3D-AD, and 7.7% and 5.6% on EyeCandies, establishing new performance benchmarks.

Key takeaway

For Machine Learning Engineers developing few-shot anomaly detection systems, particularly with multi-modal inputs, CMDS-AD offers a robust approach to overcome data scarcity and reduce false positives. You should consider its dual-stream decoupling strategy, which separates structural and defect signals, and its use of diffusion models for data augmentation and normal estimation. This method significantly improves detection accuracy on complex datasets like MVTec 3D-AD and EyeCandies.

Key insights

CMDS-AD uses dual-stream decoupling and diffusion models to enhance few-shot multi-modal anomaly detection by separating structural and defect signals.

Principles

Method

CMDS-AD generates RGB samples via LoRA-guided diffusion, estimates low-frequency normals using a pre-trained diffusion model, and aligns cross-modal semantics with a hierarchical feature mapper, then scores anomalies.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.