Semantic-aware Adversarial Fine-tuning for CLIP

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

The paper introduces Semantic-aware Adversarial Fine-Tuning (SAFT), a novel framework designed to enhance the adversarial robustness of CLIP models in zero-shot classification tasks. Existing adversarial fine-tuning methods, which rely on minimizing cosine similarity between images and single hand-crafted text templates (e.g., "A photo of a {label}"), are shown to be less effective when evaluated with semantically richer similarity metrics. SAFT addresses this by generating "semantic-aware adversarial examples" (AEs) through a "semantic-ensemble attack." This attack minimizes the average similarity between an image and an ensemble of refined textual descriptions, which are initially generated by a foundation model (LLM or MLLM) and then filtered to remove hallucinations. Experiments across 16 datasets demonstrate that SAFT significantly improves zero-shot adversarial robustness, outperforming prior methods by at least 3.85% on average against PGD-100 attacks, while maintaining high clean accuracy. The code is available at https://github.com/tmlr-group/SAFT.

Key takeaway

For research scientists and computer vision engineers developing robust vision-language models, SAFT offers a method to significantly improve CLIP's zero-shot adversarial robustness. You should consider integrating hallucination-aware, semantically enriched textual descriptions into your adversarial fine-tuning pipelines, moving beyond single-template approaches. This approach enhances robustness against various attacks and improves generalization to unseen text templates, crucial for reliable real-world deployment.

Key insights

Semantically enriched text descriptions improve CLIP's adversarial robustness by generating more effective adversarial examples.

Principles

Method

SAFT generates semantic-aware AEs by minimizing average similarity between images and an ensemble of hallucination-filtered, foundation model-generated textual descriptions, then fine-tunes CLIP's image encoder with these AEs.

In practice

Topics

Code references

Best for: Computer Vision Engineer, Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.