SS-TPT: Stability and Suitability-Guided Test-Time Prompt Tuning for Adversarially Robust Vision-Language Models

2026-06-05 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, quick

Summary

Stability and Suitability-guided Test-time Prompt Tuning (SS-TPT) is a novel method designed to enhance the adversarial robustness of Vision-Language Models (VLMs) such as CLIP, which are typically fragile under adversarial perturbations. Unlike previous test-time adaptation defenses that rely on numerous augmented views, leading to significant slowdowns and a robustness-throughput trade-off, SS-TPT improves efficiency and performance. It achieves this by evaluating the quality of each augmented view using two complementary metrics: stability, which quantifies prediction invariance to weak augmentations, and suitability, which assesses feature-space density among views. These "SS scores" are crucial for both adaptation, through an SS-guided consistency loss, and inference, via an SS-weighted prediction, effectively prioritizing reliable views and diminishing corrupted ones. Extensive experiments confirm that SS-TPT significantly outperforms prior methods, delivering superior robustness-throughput trade-offs across diverse datasets and varying numbers of views, demonstrating strong practicality and generality.

Key takeaway

For Machine Learning Engineers deploying Vision-Language Models like CLIP in adversarial environments, SS-TPT offers a critical solution to the robustness-throughput dilemma. If you are struggling with slow inference due to extensive data augmentation for defense, you should consider integrating SS-TPT. This method allows you to achieve superior adversarial robustness by intelligently weighting augmented views based on their stability and suitability, significantly reducing computational overhead compared to prior approaches.

Key insights

SS-TPT enhances VLM adversarial robustness by dynamically weighting augmented views using stability and suitability scores, optimizing robustness-throughput.

Principles

Prediction invariance signals view trustworthiness.
Feature-space density indicates view quality.
Intelligent view weighting improves robustness.

Method

SS-TPT evaluates augmented views using stability (prediction invariance to weak augmentations) and suitability (feature-space density). These SS scores guide adaptation via a consistency loss and inference via a weighted prediction, amplifying trustworthy views.

In practice

Apply SS-TPT for VLM adversarial robustness.
Use SS scores to filter augmented views.
Optimize robustness-throughput trade-offs.

Topics

Vision-Language Models
Adversarial Robustness
Test-Time Adaptation
Prompt Tuning
CLIP
Model Inference Efficiency

Code references

sunoh-kim/SS-TPT

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.