Visual Sparse Steering (VS2): Unsupervised Adaptation for Image Classification using Sparsity-Guided Steering Vectors

2026-04-16 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Advanced, extended

Summary

Researchers from Rutgers University introduce Visual Sparse Steering (VS2), a test-time method designed to improve zero-shot image classification in vision foundation models without retraining or large labeled datasets. VS2 guides models using steering vectors derived from sparse features learned by top-$k$ Sparse Autoencoders. VS2 improves zero-shot CLIP performance by 4.12% on CIFAR-100, 1.08% on CUB-200, and 1.84% on Tiny-ImageNet. They also propose VS2++, a retrieval-augmented variant that uses pseudo-labeled neighbors to selectively amplify relevant sparse features, achieving absolute top-1 gains over CLIP zero-shot of up to 21.44% on CIFAR-100, 7.08% on CUB-200, and 20.47% on Tiny-ImageNet with oracle sets. Additionally, Prototype-Aligned Sparse Steering (PASS) incorporates a prototype-alignment loss during SAE training, outperforming VS2 by 6.12% on CIFAR-100 with ViT-B/32.

Key takeaway

Research Scientists investigating vision foundation model interpretability and efficiency should explore Visual Sparse Steering (VS2) to enhance zero-shot classification. This method offers significant accuracy gains without costly retraining, particularly for disambiguating visually or taxonomically proximate categories. You can implement VS2 or its variants (VS2++, PASS) to improve model performance at inference time, leveraging sparse autoencoders for more controllable and reliable vision models.

Key insights

Sparse Autoencoders can generate effective steering vectors for zero-shot image classification in vision models.

Principles

Sparse features capture key visual embedding aspects.
Steering vectors can disambiguate visually similar categories.
Retrieval-augmented steering enhances performance with unlabeled data.

Method

VS2 constructs steering vectors from sparse features learned by top-$k$ Sparse Autoencoders. VS2++ uses pseudo-labeled neighbors from retrieved similar images to create contrastive steering vectors. PASS adds a prototype-alignment loss during SAE training.

In practice

Use VS2 for zero-shot classification without extra data.
Apply VS2++ with unlabeled image data for higher gains.
Consider PASS for modest, consistent improvements during SAE training.

Topics

Visual Sparse Steering (VS2)
Zero-shot Image Classification
Sparse Autoencoders
Steering Vectors
Vision Foundation Models

Code references

EleutherAI/sparsify

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.