Ramen: Robust Test-Time Adaptation of Vision-Language Models with Active Sample Selection

2026-04-23 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

Ramen is a novel framework designed for robust test-time adaptation of vision-language models (VLMs) like CLIP, specifically addressing performance degradation under mixed-domain distribution shifts. Unlike prior methods that assume consistent test domains, Ramen actively selects relevant samples from previously seen data for each incoming test sample. This selection is based on two criteria: domain consistency, to focus adaptation on similar domains, and prediction balance, to counteract adaptation bias from skewed predictions. To enhance efficiency, Ramen utilizes an embedding-gradient cache, storing embeddings and sample-level gradients of past images. This cache enables retrieval of relevant samples and aggregation of gradients for model updates without requiring additional forward or backward passes. Theoretical analysis supports its effectiveness, and experiments on various image corruption and domain-shift benchmarks confirm Ramen's strong and consistent performance in complex mixed-domain scenarios.

Key takeaway

For research scientists developing or deploying vision-language models in real-world applications, Ramen offers a practical solution to maintain performance under diverse and unpredictable data shifts. You should consider integrating active sample selection and gradient caching into your test-time adaptation strategies to ensure robust model behavior across mixed domains, reducing the need for costly retraining or manual domain identification.

Key insights

Ramen robustly adapts vision-language models to mixed-domain shifts using active sample selection and an embedding-gradient cache.

Principles

Adaptation benefits from domain-consistent samples.
Prediction balance mitigates adaptation bias.
Caching gradients improves adaptation efficiency.

Method

Ramen retrieves customized batches from past data using domain consistency and prediction balance criteria, then aggregates cached embedding-gradients for model updates.

In practice

Apply active sample selection for mixed domains.
Implement embedding-gradient caching for efficiency.

Topics

Test-Time Adaptation
Vision-Language Models
Ramen Framework
Active Sample Selection
Embedding-Gradient Cache

Code references

baowenxuan/Ramen

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.