Your UnEmbedding Matrix is Secretly a Feature Lens for Text Embeddings

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, extended

Summary

EmbedFilter, a novel linear transformation, refines text embeddings derived from large language models (LLMs) to address their suboptimal zero-shot performance on text embedding benchmarks. The research identifies that LLMs struggle because their raw embeddings align with frequent, uninformative tokens when projected onto the vocabulary space, a bias encoded within the unembedding matrix's "edge spectrum" subspace. By filtering out this subspace, EmbedFilter enhances semantic representations, achieving up to a 14.1% improvement on the MTEB benchmark. Crucially, this method also enables inherent dimensionality reduction, allowing embeddings to be reduced to 1/8 of their original size, which lowers index storage and speeds up retrieval. The effectiveness of EmbedFilter is demonstrated across multiple LLM backbones, including Qwen, Llama, and Mistral.

Key takeaway

For Machine Learning Engineers deploying LLMs for text embedding tasks, especially if you are facing suboptimal zero-shot performance or high storage/retrieval costs, you should consider applying EmbedFilter. This simple linear transformation, which requires no additional training, significantly improves semantic quality and enables substantial dimensionality reduction. Implementing EmbedFilter allows your LLMs to function more effectively and efficiently in real-world, resource-constrained applications, outperforming even well-trained baselines from the pre-LLM era.

Key insights

LLM unembedding matrices encode an "edge spectrum" subspace biasing embeddings towards frequent, uninformative tokens, which EmbedFilter removes.

Principles

Method

EmbedFilter applies a Bulk Spectrum Transformation (Φτ) using the unembedding matrix's right singular vectors, excluding those with the largest and smallest singular values, to refine embeddings.

In practice

Topics

Code references

Best for: AI Engineer, Research Scientist, AI Architect, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.