Nemotron ColEmbed V2: Raising the Bar for Multimodal Retrieval with ViDoRe V3’s Top Model
Summary
NVIDIA has released the Nemotron ColEmbed V2 family of late-interaction embedding models, available in 3B, 4B, and 8B sizes, designed for highly accurate multimodal retrieval. These models achieve state-of-the-art performance on the ViDoRe V1, V2, and V3 benchmarks, with the nemotron-colembed-vl-8b-v2 model ranking #1 on ViDoRe V3 with an Avg NDCG@10 of 63.42. The models extend the ColBERT-style late interaction mechanism to a multimodal setting, enabling fine-grained interactions between query and document tokens, whether textual or visual. This approach, while increasing storage requirements, enhances accuracy by capturing detailed semantic relationships. The Nemotron ColEmbed V2 models are built on foundational VLMs like google/siglip2-giant-opt-patch16-384, meta-llama/Llama-3.2-3B, Qwen3-VL-8B-Instruct, and Qwen3-VL-4B-Instruct, and were trained using a bi-encoder architecture with contrastive learning and hard negative mining.
Key takeaway
For AI Architects and Research Scientists building multimodal retrieval systems where accuracy is paramount, Nemotron ColEmbed V2 offers a significant advancement. Its late-interaction architecture and top performance on ViDoRe V3 suggest it can enhance the precision of your RAG systems and cross-modal search applications. Consider integrating the 8B model for leading accuracy, understanding that it requires increased storage compared to single-vector alternatives.
Key insights
Nemotron ColEmbed V2 models set new accuracy standards for multimodal retrieval using late-interaction embeddings.
Principles
- Late-interaction embeddings improve multimodal retrieval accuracy.
- Bi-directional self-attention enhances representation learning.
- Contrastive learning with hard negative mining boosts performance.
Method
The models use a bi-encoder architecture with ColBERT-style late interaction, computing query and document token embeddings independently, then matching them via a MaxSim operator and summing maxima for relevance scoring.
In practice
- Use for multimodal RAG systems requiring high accuracy.
- Apply in multimedia search engines.
- Integrate into conversational AI for rich input understanding.
Topics
- Multimodal Retrieval
- Late Interaction Models
- Nemotron ColEmbed V2
- ViDoRe Benchmark
- RAG Systems
Best for: AI Architect, AI Scientist, Research Scientist, AI Researcher, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Hugging Face - Blog.