Multimodal Embedding & Reranker Models with Sentence Transformers

2026-04-10 · Source: Hugging Face - Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, long

Summary

The Sentence Transformers Python library, with its v5.4 update released on April 9, 2026, now supports multimodal embedding and reranker models, enabling the encoding and comparison of text, images, audio, and video. Multimodal embedding models map diverse inputs into a shared vector space for cross-modal similarity comparisons, while multimodal reranker models score the relevance of mixed-modality pairs. This expansion facilitates applications such as visual document retrieval, cross-modal search, and multimodal Retrieval Augmented Generation (RAG) pipelines. The update includes new models like Qwen3-VL-Embedding-2B and Qwen3-VL-Reranker-2B, which require specific GPU VRAM (e.g., ~8 GB for 2B variants, ~20 GB for 8B variants). The library maintains a consistent API for loading and encoding, with added functionalities like `encode_query()` and `encode_document()` for retrieval tasks, and `rank()` for reranking mixed-modality documents.

Key takeaway

For AI Engineers building RAG or semantic search systems, the Sentence Transformers v5.4 update significantly expands capabilities by integrating multimodal embedding and reranking. You should explore the new `Qwen/Qwen3-VL-Embedding-2B` and `Qwen/Qwen3-VL-Reranker-2B` models to enhance cross-modal retrieval and relevance scoring, particularly for applications involving mixed media like images and text. Be mindful of the GPU VRAM requirements, especially for larger 8B parameter models, and consider a retrieve-and-rerank pipeline for optimal performance and accuracy.

Key insights

Sentence Transformers v5.4 introduces multimodal embedding and reranking for text, images, audio, and video.

Principles

Multimodal embeddings map diverse inputs to a shared vector space.
Cross-modal similarities are typically lower than within-modal ones.
Rerankers offer higher quality but are slower than embedding models.

Method

Use `model.encode()` for multimodal embeddings, `model.similarity()` for cross-modal comparisons, and `model.rank()` or `model.predict()` for multimodal reranking, often combining embedding for retrieval and reranking for refinement.

In practice

Install `sentence-transformers[image,video]` for specific modality support.
Use `encode_query()` and `encode_document()` for retrieval tasks.
Configure `processor_kwargs` for image resolution and `model_kwargs` for precision.

Topics

Sentence Transformers
Multimodal Embeddings
Multimodal Rerankers
Retrieval-Augmented Generation
Cross-Modal Search

Best for: AI Engineer, Machine Learning Engineer, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Hugging Face - Blog.