Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers
Summary
This article, published on April 16, 2026, details how to train and finetune multimodal embedding and reranker models using the Sentence Transformers Python library. It focuses on a practical example of finetuning the `Qwen/Qwen3-VL-Embedding-2B` model for Visual Document Retrieval (VDR), a task involving retrieving relevant document pages (images) for a given text query. The finetuned model, `tomaarsen/Qwen3-VL-Embedding-2B-vdr`, achieved an NDCG@10 of 0.947, significantly outperforming the base model's 0.888 and surpassing all other tested VDR models, including those up to 4x its 2.1B parameter size. The article also demonstrates the use of `MatryoshkaLoss` to enable effective embedding truncation, showing the finetuned model maintains near-peak performance even at 512 dimensions, and briefly covers training multimodal reranker models.
Key takeaway
For AI Engineers building retrieval systems, finetuning multimodal embedding models on your specific domain data can yield substantial performance gains, even with smaller models. You should consider using `CachedMultipleNegativesRankingLoss` with `mini_batch_size=1` and `MatryoshkaLoss` to optimize for both performance and deployment flexibility, allowing for efficient embedding truncation without significant quality degradation.
Key insights
Finetuning multimodal embedding models on domain-specific data significantly boosts performance over general-purpose models.
Principles
- Domain-specific finetuning improves model performance.
- Larger batch sizes enhance training signal for ranking losses.
- Matryoshka training enables flexible embedding dimensionality.
Method
Finetune existing multimodal embedding models or VLMs using `SentenceTransformerTrainer`, `CachedMultipleNegativesRankingLoss` (with `mini_batch_size=1`), and `MatryoshkaLoss` for dimension flexibility.
In practice
- Use `bf16=True` for better numerical stability.
- Set `batch_sampler=BatchSamplers.NO_DUPLICATES` for ranking losses.
- Truncate embeddings for faster search with minimal quality loss.
Topics
- Sentence Transformers
- Multimodal Embeddings
- Visual Document Retrieval
- Model Finetuning
- Matryoshka Loss
Code references
Best for: Machine Learning Engineer, AI Engineer, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Hugging Face - Blog.