TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

2026-06-05 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision, Natural Language Processing · Depth: Expert, quick

Summary

TEVI is a framework designed to improve vision-language alignment in models like CLIP by using captions to refine image embeddings. It addresses the common issue of information imbalance, where images contain more data than their associated captions describe, leading to suboptimal performance. TEVI employs sparse autoencoders to disentangle image embeddings and then trains a masking module to selectively reconstruct these embeddings based on a given caption. This process ensures that only caption-described attributes are retained while others are discarded. Experiments show TEVI effectively preserves relevant attributes with synthetic captions and significantly improves retrieval performance on natural image benchmarks such as MS COCO, Flickr, IIW, and DOCCI, demonstrating stronger gains with richer captions and enhanced robustness on the RoCOCO benchmark.

Key takeaway

For Machine Learning Engineers optimizing vision-language models, TEVI offers a promising approach to enhance alignment and retrieval accuracy. By utilizing captions to selectively refine image embeddings, you can mitigate the information imbalance issue inherent in many datasets. Consider integrating TEVI's sparse autoencoder and masking module techniques to achieve stronger performance, especially with richer captions, and improve robustness in your VLM applications.

Key insights

TEVI refines image embeddings using captions to improve vision-language alignment by addressing information imbalance.

Principles

Image-text information imbalance degrades VLM alignment.
Captions can serve as a signal for relevant image attributes.
Disentangling embeddings aids selective reconstruction.

Method

TEVI uses sparse autoencoders to disentangle image embeddings. A masking module is then trained to selectively reconstruct the embedding based on a given caption, preserving described attributes.

In practice

Enhance CLIP model retrieval performance.
Improve robustness on vision-language benchmarks.
Refine image embeddings for specific attributes.

Topics

TEVI
Vision-Language Models
CLIP
Image Embeddings
Sparse Autoencoders
Information Imbalance
Retrieval Performance

Best for: AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.