When CLIP Sees More, It Fights Back Harder: Multi-View Guided Adaptive Counterattacks for Test-Time Adversarial Robustness
Summary
Multi-view guided Adaptive Counterattack (MAC) is a novel method designed to enhance the adversarial robustness of vision-language models like CLIP. While CLIP excels in zero-shot recognition, its vulnerability to adversarial perturbations, particularly strong attacks, remains a significant limitation for existing Test-Time Counterattack (TTC) approaches. TTC struggles due to its reliance on a single, directly corrupted view and a rigid, noise-driven hard-gating scheme that cannot adapt to varying corruption levels. MAC overcomes these issues by generating augmented views of an input image to diversify embeddings. It then applies counterattacks to refine these corrupted view embeddings, adaptively scaling the counterattack intensity for each view based on its estimated corruption degree. Finally, these adaptively counterattacked views are aggregated to produce a robust final prediction. MAC demonstrates substantial robustness improvements across 20 datasets and various attack scenarios, all while preserving high inference speed and memory efficiency through its tuning-free architecture. This work was accepted in CVPR2026.
Key takeaway
For Machine Learning Engineers or AI Scientists deploying vision-language models like CLIP in security-sensitive applications, Multi-view guided Adaptive Counterattack (MAC) provides a critical advancement. You should consider integrating MAC to significantly improve your model's robustness against strong adversarial perturbations. Its tuning-free design ensures high inference speed and memory efficiency, offering a practical solution to enhance model resilience without complex fine-tuning or performance degradation.
Key insights
MAC improves CLIP's adversarial robustness by adaptively counterattacking multiple augmented views based on corruption severity.
Principles
- Multi-view augmentation diversifies embeddings for robustness.
- Adaptive counterattack intensity improves resilience to varying corruption.
- Aggregating refined views yields robust final predictions.
Method
MAC constructs augmented views, refines corrupted embeddings via counterattacks, adaptively scales intensity per view based on corruption, then aggregates for robust prediction.
In practice
- Apply multi-view augmentation to enhance model robustness.
- Implement adaptive counterattack scaling for varied attack strengths.
- Integrate MAC's tuning-free design for efficient inference.
Topics
- CLIP
- Adversarial Robustness
- Multi-view Learning
- Test-Time Counterattack
- Vision-Language Models
- Computer Vision
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.