SS-TPT: Stability and Suitability-Guided Test-Time Prompt Tuning for Adversarially Robust Vision-Language Models

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, quick

Summary

Stability and Suitability-guided Test-time Prompt Tuning (SS-TPT) addresses the fragility of vision-language models (VLMs) like CLIP against adversarial perturbations, a common issue with existing test-time adaptation defenses that incur significant slowdowns. SS-TPT improves robustness and throughput by evaluating the quality of each augmented view using two complementary scores: stability, which measures prediction invariance to weak augmentations, and suitability, assessing feature-space density among views. These SS scores guide both adaptation and inference through an SS-guided consistency loss and an SS-weighted prediction, effectively amplifying trustworthy views while suppressing corrupted ones. Experiments show SS-TPT significantly outperforms prior state-of-the-art methods, achieving superior robustness-throughput trade-offs across diverse datasets and varying numbers of views, demonstrating strong practicality and generality.

Key takeaway

For machine learning engineers deploying vision-language models in security-sensitive applications, SS-TPT offers a practical solution to enhance adversarial robustness without sacrificing inference speed. You should consider integrating SS-TPT to improve model resilience against perturbations, leveraging its stability and suitability scores to dynamically filter augmented views. This approach allows you to achieve superior robustness-throughput trade-offs, making your VLM deployments more reliable and efficient in real-world scenarios.

Key insights

SS-TPT enhances VLM adversarial robustness by dynamically weighting augmented views based on prediction stability and feature suitability.

Principles

Method

SS-TPT uses stability and suitability scores to guide adaptation via an SS-guided consistency loss and inference through an SS-weighted prediction, amplifying trustworthy views.

In practice

Topics

Code references

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.