Multimodal Embedding & Reranker Models with Sentence Transformers
Summary
The Sentence Transformers Python library, with its v5.4 update released on April 9, 2026, now supports multimodal embedding and reranker models, enabling the encoding and comparison of text, images, audio, and video. Multimodal embedding models map diverse inputs into a shared vector space for cross-modal similarity comparisons, while multimodal reranker models score the relevance of mixed-modality pairs. This expansion facilitates applications such as visual document retrieval, cross-modal search, and multimodal Retrieval Augmented Generation (RAG) pipelines. The update includes new models like Qwen3-VL-Embedding-2B and Qwen3-VL-Reranker-2B, which require specific GPU VRAM (e.g., ~8 GB for 2B variants, ~20 GB for 8B variants). The library maintains a consistent API for loading and encoding, with added functionalities like `encode_query()` and `encode_document()` for retrieval tasks, and `rank()` for reranking mixed-modality documents.
Key takeaway
For AI Engineers building RAG or semantic search systems, the Sentence Transformers v5.4 update significantly expands capabilities by integrating multimodal embedding and reranking. You should explore the new `Qwen/Qwen3-VL-Embedding-2B` and `Qwen/Qwen3-VL-Reranker-2B` models to enhance cross-modal retrieval and relevance scoring, particularly for applications involving mixed media like images and text. Be mindful of the GPU VRAM requirements, especially for larger 8B parameter models, and consider a retrieve-and-rerank pipeline for optimal performance and accuracy.
Key insights
Sentence Transformers v5.4 introduces multimodal embedding and reranking for text, images, audio, and video.
Principles
- Multimodal embeddings map diverse inputs to a shared vector space.
- Cross-modal similarities are typically lower than within-modal ones.
- Rerankers offer higher quality but are slower than embedding models.
Method
Use `model.encode()` for multimodal embeddings, `model.similarity()` for cross-modal comparisons, and `model.rank()` or `model.predict()` for multimodal reranking, often combining embedding for retrieval and reranking for refinement.
In practice
- Install `sentence-transformers[image,video]` for specific modality support.
- Use `encode_query()` and `encode_document()` for retrieval tasks.
- Configure `processor_kwargs` for image resolution and `model_kwargs` for precision.
Topics
- Sentence Transformers
- Multimodal Embeddings
- Multimodal Rerankers
- Retrieval-Augmented Generation
- Cross-Modal Search
Best for: AI Engineer, Machine Learning Engineer, AI Student
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Hugging Face - Blog.