When CLIP Sees More, It Fights Back Harder: Multi-View Guided Adaptive Counterattacks for Test-Time Adversarial Robustness

2026-06-05 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

Multi-view guided Adaptive Counterattack (MAC) is a novel method designed to enhance the adversarial robustness of vision-language models like CLIP. Addressing the fragility of existing Test-time counterattack (TTC) under strong attacks, MAC introduces a corruption-aware soft weighting scheme for multi-view counterattacks. The process involves constructing augmented views of an input image to generate diverse embeddings, then refining these corrupted embeddings. MAC adaptively scales the counterattack intensity for each view based on its estimated corruption degree. Finally, these adaptively counterattacked views are aggregated to produce a robust final prediction. Experiments across 20 datasets and various attack scenarios demonstrate that MAC significantly improves robustness while maintaining high inference speed and memory efficiency, thanks to its tuning-free design.

Key takeaway

For Machine Learning Engineers deploying vision-language models like CLIP in security-sensitive applications, MAC offers a robust solution against adversarial perturbations. You should consider integrating multi-view guided adaptive counterattacks to significantly enhance your model's resilience. This tuning-free approach preserves high inference speed and memory efficiency, making it practical for production environments where strong attack scenarios are a concern.

Key insights

MAC improves CLIP's adversarial robustness by adaptively counterattacking multi-view image embeddings with corruption-aware weighting.

Principles

Adversarial robustness benefits from multi-view processing.
Adaptive scaling of counterattack intensity is crucial.
Corruption awareness enhances defense efficacy.

Method

MAC constructs augmented image views, refines their corrupted embeddings, adaptively scales counterattack intensity per view based on corruption, then aggregates views for robust prediction.

In practice

Implement multi-view augmentation for robustness.
Dynamically adjust defense strength based on corruption.
Aggregate diverse view embeddings for final prediction.

Topics

CLIP
Adversarial Robustness
Multi-View Learning
Test-Time Counterattack
Vision-Language Models
Image Embeddings

Code references

sunoh-kim/MAC

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.