Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers

2026-04-16 · Source: Hugging Face - Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Software Development & Engineering · Depth: Intermediate, long

Summary

This article, published on April 16, 2026, details how to train and finetune multimodal embedding and reranker models using the Sentence Transformers Python library. It focuses on a practical example of finetuning the `Qwen/Qwen3-VL-Embedding-2B` model for Visual Document Retrieval (VDR), a task involving retrieving relevant document pages (images) for a given text query. The finetuned model, `tomaarsen/Qwen3-VL-Embedding-2B-vdr`, achieved an NDCG@10 of 0.947, significantly outperforming the base model's 0.888 and surpassing all other tested VDR models, including those up to 4x its 2.1B parameter size. The article also demonstrates the use of `MatryoshkaLoss` to enable effective embedding truncation, showing the finetuned model maintains near-peak performance even at 512 dimensions, and briefly covers training multimodal reranker models.

Key takeaway

For AI Engineers building retrieval systems, finetuning multimodal embedding models on your specific domain data can yield substantial performance gains, even with smaller models. You should consider using `CachedMultipleNegativesRankingLoss` with `mini_batch_size=1` and `MatryoshkaLoss` to optimize for both performance and deployment flexibility, allowing for efficient embedding truncation without significant quality degradation.

Key insights

Finetuning multimodal embedding models on domain-specific data significantly boosts performance over general-purpose models.

Principles

Domain-specific finetuning improves model performance.
Larger batch sizes enhance training signal for ranking losses.
Matryoshka training enables flexible embedding dimensionality.

Method

Finetune existing multimodal embedding models or VLMs using `SentenceTransformerTrainer`, `CachedMultipleNegativesRankingLoss` (with `mini_batch_size=1`), and `MatryoshkaLoss` for dimension flexibility.

In practice

Use `bf16=True` for better numerical stability.
Set `batch_sampler=BatchSamplers.NO_DUPLICATES` for ranking losses.
Truncate embeddings for faster search with minimal quality loss.

Topics

Sentence Transformers
Multimodal Embeddings
Visual Document Retrieval
Model Finetuning
Matryoshka Loss

Code references

Best for: Machine Learning Engineer, AI Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Hugging Face - Blog.