Nemotron ColEmbed V2: Raising the Bar for Multimodal Retrieval with ViDoRe V3’s Top Model

2026-02-04 · Source: Hugging Face - Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, medium

Summary

NVIDIA has released the Nemotron ColEmbed V2 family of late-interaction embedding models, available in 3B, 4B, and 8B sizes, designed for highly accurate multimodal retrieval. These models achieve state-of-the-art performance on the ViDoRe V1, V2, and V3 benchmarks, with the nemotron-colembed-vl-8b-v2 model ranking #1 on ViDoRe V3 with an Avg NDCG@10 of 63.42. The models extend the ColBERT-style late interaction mechanism to a multimodal setting, enabling fine-grained interactions between query and document tokens, whether textual or visual. This approach, while increasing storage requirements, enhances accuracy by capturing detailed semantic relationships. The Nemotron ColEmbed V2 models are built on foundational VLMs like google/siglip2-giant-opt-patch16-384, meta-llama/Llama-3.2-3B, Qwen3-VL-8B-Instruct, and Qwen3-VL-4B-Instruct, and were trained using a bi-encoder architecture with contrastive learning and hard negative mining.

Key takeaway

For AI Architects and Research Scientists building multimodal retrieval systems where accuracy is paramount, Nemotron ColEmbed V2 offers a significant advancement. Its late-interaction architecture and top performance on ViDoRe V3 suggest it can enhance the precision of your RAG systems and cross-modal search applications. Consider integrating the 8B model for leading accuracy, understanding that it requires increased storage compared to single-vector alternatives.

Key insights

Nemotron ColEmbed V2 models set new accuracy standards for multimodal retrieval using late-interaction embeddings.

Principles

Late-interaction embeddings improve multimodal retrieval accuracy.
Bi-directional self-attention enhances representation learning.
Contrastive learning with hard negative mining boosts performance.

Method

The models use a bi-encoder architecture with ColBERT-style late interaction, computing query and document token embeddings independently, then matching them via a MaxSim operator and summing maxima for relevance scoring.

In practice

Use for multimodal RAG systems requiring high accuracy.
Apply in multimedia search engines.
Integrate into conversational AI for rich input understanding.

Topics

Multimodal Retrieval
Late Interaction Models
Nemotron ColEmbed V2
ViDoRe Benchmark
RAG Systems

Best for: AI Architect, AI Scientist, Research Scientist, AI Researcher, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Hugging Face - Blog.