TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

TEVI, a framework developed by researchers at the Max Planck Institute for Informatics and Saarland University, addresses the poor alignment between image and text embeddings in vision-language models like CLIP. This misalignment, often attributed to information imbalance where images contain more data than their captions, affects downstream performance. TEVI utilizes sparse autoencoders to disentangle image embeddings and trains a masking module to selectively reconstruct these embeddings based on a given caption. In a controlled synthetic setup, TEVI effectively preserved caption-described attributes while discarding others. When applied to CLIP, SigLIP, SharedCLIP, and AlignCLIP models trained on the CC12M dataset, TEVI consistently improved retrieval performance across coarse-grained (MS COCO, Flickr) and fine-grained (IIW, DOCCI) benchmarks, showing stronger gains with richer captions. It also enhanced robustness on the RoCOCO benchmark, incurring a relatively small inference overhead of approximately 2.7% increase in FLOPS for 1000 image-text pairs.

Key takeaway

For Machine Learning Engineers developing vision-language applications, TEVI offers a post-hoc method to significantly improve cross-modal retrieval performance and robustness. You should consider integrating TEVI, especially for systems handling rich, long captions or requiring resilience against linguistic perturbations. This approach enhances alignment without retraining core VLM encoders, providing a targeted and efficient upgrade to existing CLIP-family models.

Key insights

TEVI enhances vision-language alignment by using captions to selectively edit image embeddings via sparse autoencoders.

Principles

Method

TEVI trains a masking MLP on text embeddings to select sparse autoencoder latents, which then reconstruct image embeddings, optimized via InfoNCE loss.

In practice

Topics

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.