Semantic-aware Adversarial Fine-tuning for CLIP
Summary
The paper introduces Semantic-aware Adversarial Fine-Tuning (SAFT), a novel framework designed to enhance the adversarial robustness of CLIP models in zero-shot classification tasks. Existing adversarial fine-tuning methods, which rely on minimizing cosine similarity between images and single hand-crafted text templates (e.g., "A photo of a {label}"), are shown to be less effective when evaluated with semantically richer similarity metrics. SAFT addresses this by generating "semantic-aware adversarial examples" (AEs) through a "semantic-ensemble attack." This attack minimizes the average similarity between an image and an ensemble of refined textual descriptions, which are initially generated by a foundation model (LLM or MLLM) and then filtered to remove hallucinations. Experiments across 16 datasets demonstrate that SAFT significantly improves zero-shot adversarial robustness, outperforming prior methods by at least 3.85% on average against PGD-100 attacks, while maintaining high clean accuracy. The code is available at https://github.com/tmlr-group/SAFT.
Key takeaway
For research scientists and computer vision engineers developing robust vision-language models, SAFT offers a method to significantly improve CLIP's zero-shot adversarial robustness. You should consider integrating hallucination-aware, semantically enriched textual descriptions into your adversarial fine-tuning pipelines, moving beyond single-template approaches. This approach enhances robustness against various attacks and improves generalization to unseen text templates, crucial for reliable real-world deployment.
Key insights
Semantically enriched text descriptions improve CLIP's adversarial robustness by generating more effective adversarial examples.
Principles
- Cosine similarity with single templates is insufficient for robust image-text alignment.
- Adversarial examples must generalize across linguistic variations for effective fine-tuning.
- Semantic filtering of LLM-generated descriptions is crucial to mitigate hallucinations.
Method
SAFT generates semantic-aware AEs by minimizing average similarity between images and an ensemble of hallucination-filtered, foundation model-generated textual descriptions, then fine-tunes CLIP's image encoder with these AEs.
In practice
- Use foundation models to generate diverse class descriptions.
- Filter generated descriptions for semantic relevance (e.g., top-K cosine similarity).
- Employ ensemble-based adversarial attacks for robust fine-tuning.
Topics
- CLIP Adversarial Robustness
- Adversarial Fine-tuning
- Semantic-aware AEs
- Zero-shot Classification
- Foundation Models
Code references
Best for: Computer Vision Engineer, Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.