Improving CLIP Adaptation by Breaking Tail Alignment for Source-Free Cross-Domain Few-Shot Learning

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

Vision-Language Models (VLMs) like CLIP face significant performance degradation in Cross-Domain Few-Shot Learning (CDFSL) when target-domain training data is scarce. A novel finding reveals that actively pushing away low-similarity "tail tokens" from their corresponding textual embeddings consistently enhances target-domain performance. This counterintuitive phenomenon is interpreted as a necessity: under severe domain shifts and limited data, uniform alignment leads to overfitting for semantically poor tail tokens, while breaking this alignment is more beneficial. Motivated by this, researchers propose Adaptive Tail-Head Alignment (ATHA), a fine-tuning strategy for CLIP that transitions from conventional uniform alignment to an adaptive approach, incorporating both alignment strengthening and weakening. Extensive experiments demonstrate ATHA achieves state-of-the-art performance across four challenging CDFSL benchmarks.

Key takeaway

For Machine Learning Engineers adapting Vision-Language Models like CLIP to new domains with limited data, you should re-evaluate conventional uniform alignment strategies. This research suggests that forcing alignment for all image tokens, especially low-similarity "tail tokens," can lead to overfitting. Instead, consider implementing adaptive alignment techniques like ATHA, which selectively strengthens or weakens token-embedding relationships, to significantly improve your model's cross-domain few-shot learning performance.

Key insights

Breaking uniform alignment for low-similarity "tail tokens" improves CLIP adaptation in cross-domain few-shot learning.

Principles

Method

Adaptive Tail-Head Alignment (ATHA) transforms uniform CLIP fine-tuning into an adaptive paradigm, selectively strengthening or weakening alignment between image patch tokens and textual embeddings.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.