TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment
Summary
TEVI, a framework developed by researchers at the Max Planck Institute for Informatics and Saarland University, addresses the poor alignment between image and text embeddings in vision-language models like CLIP. This misalignment, often attributed to information imbalance where images contain more data than their captions, affects downstream performance. TEVI utilizes sparse autoencoders to disentangle image embeddings and trains a masking module to selectively reconstruct these embeddings based on a given caption. In a controlled synthetic setup, TEVI effectively preserved caption-described attributes while discarding others. When applied to CLIP, SigLIP, SharedCLIP, and AlignCLIP models trained on the CC12M dataset, TEVI consistently improved retrieval performance across coarse-grained (MS COCO, Flickr) and fine-grained (IIW, DOCCI) benchmarks, showing stronger gains with richer captions. It also enhanced robustness on the RoCOCO benchmark, incurring a relatively small inference overhead of approximately 2.7% increase in FLOPS for 1000 image-text pairs.
Key takeaway
For Machine Learning Engineers developing vision-language applications, TEVI offers a post-hoc method to significantly improve cross-modal retrieval performance and robustness. You should consider integrating TEVI, especially for systems handling rich, long captions or requiring resilience against linguistic perturbations. This approach enhances alignment without retraining core VLM encoders, providing a targeted and efficient upgrade to existing CLIP-family models.
Key insights
TEVI enhances vision-language alignment by using captions to selectively edit image embeddings via sparse autoencoders.
Principles
- Information imbalance causes modality gap in VLMs.
- Text captions can guide image embedding refinement.
- Disentangled representations improve fine-grained control.
Method
TEVI trains a masking MLP on text embeddings to select sparse autoencoder latents, which then reconstruct image embeddings, optimized via InfoNCE loss.
In practice
- Apply TEVI post-hoc to pretrained CLIP models.
- Prioritize TEVI for long-caption retrieval tasks.
- Combine with existing alignment methods like AlignCLIP.
Topics
- Vision-Language Models
- CLIP
- Sparse Autoencoders
- Cross-modal Retrieval
- Embedding Alignment
- RoCOCO Benchmark
Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.