ZeroDiff++: Substantial Unseen Visual-semantic Correlation in Zero-shot Learning

2026-02-16 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision & Image Processing · Depth: Expert, extended

Summary

ZeroDiff++ is a novel diffusion-based generative framework designed to enhance visual-semantic correlations in Zero-shot Learning (ZSL), particularly when training data is scarce. It addresses spurious visual-semantic correlations in existing generative ZSL methods by introducing two metrics to quantify spuriousness for seen and unseen classes. The framework incorporates five key components: diffusion augmentation for diverse noised samples, Supervised Contrastive (SC) representations for instance-level semantics, multi-view discriminators with Wasserstein mutual learning for assessing generated features, Diffusion-based Test-time Adaptation (DiffTTA) for generator adaptation using pseudo label reconstruction, and Diffusion-based Test-time Generation (DiffGen) to produce partially synthesized features. Experiments on CUB, AWA2, and SUN datasets demonstrate that ZeroDiff++ significantly outperforms state-of-the-art ZSL methods, achieving a harmonic mean (H) of 65.8 on CUB, 71.2 on AWA2, and 67.3 on SUN, while maintaining robust performance even with only 10% of the training data.

Key takeaway

For Computer Vision Engineers developing Zero-shot Learning models, ZeroDiff++ offers a robust solution to overcome data scarcity and spurious correlations. You should consider integrating diffusion augmentation, dynamic SC-based representations, and multi-view discriminators into your training pipeline. Furthermore, adopting Diffusion-based Test-time Adaptation and Generation can significantly improve performance on unseen classes, especially when labeled data is limited, leading to more accurate and generalizable models.

Key insights

ZeroDiff++ enhances zero-shot learning by mitigating spurious visual-semantic correlations through diffusion-based generation and adaptation.

Principles

Diffusion processes increase distributional overlap, stabilizing GAN training.
Instance-level semantics improve feature generation quality.
Mutual learning across discriminators strengthens their guidance.

Method

ZeroDiff++ trains a diffusion-based generator with augmented data, dynamic SC-based semantics, and multi-view discriminators. It then adapts the generator at test time using pseudo-label reconstruction and generates features via a traceable diffusion denoising path.

In practice

Use diffusion augmentation to expand limited training datasets.
Employ Supervised Contrastive learning for dynamic instance-level representations.
Implement test-time adaptation for generators to improve unseen class performance.

Topics

Zero-shot Learning
Diffusion Models
Generative Models
Supervised Contrastive Learning
Visual-Semantic Correlation

Best for: Computer Vision Engineer, Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.