Conan-embedding-v3: Fusing Modality-Specific Models for Omni-Modal Embedding

2026-06-08 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, quick

Summary

Conan-embedding-v3 is an omni-modal embedding model designed for unified retrieval across text, image, video, document, and audio inputs. It employs a "decouple--fuse--recover" framework to address challenges in integrating diverse modalities. The model first trains modality specialists independently, then fuses their task vectors into a single dense backbone, a process termed Decoupled Specialist Fusion. This fusion successfully combines visual, video, and document retrieval capabilities. However, it introduces "Projector Drift" for projector-based modalities like audio, where the projector remains calibrated to the specialist backbone, causing significant retrieval regression. To mitigate this, Conan-embedding-v3 implements Projector Recovery, involving full-parameter fine-tuning of the projector with a frozen backbone, followed by balanced multi-modal rehearsal. The resulting model achieves 74.9 scores on MMEB and 55.61 on the 30-task MAEB audio suite.

Key takeaway

For AI Scientists developing omni-modal embedding models, understanding the "decouple--fuse--recover" framework is crucial. If you are integrating projector-based modalities like audio, be aware of "Projector Drift" after fusing backbones. You should implement a Projector Recovery phase, involving full-parameter fine-tuning of the projector with a frozen backbone, followed by balanced multi-modal rehearsal, to prevent significant retrieval regression and ensure robust performance across all modalities.

Key insights

Conan-embedding-v3 fuses modality specialists for omni-modal embeddings, but requires "Projector Recovery" to prevent "Projector Drift" in projector-based modalities.

Principles

Decoupled Specialist Fusion composes diverse retrieval capabilities.
Backbone fusion can cause "Projector Drift" for external projectors.
Projector Recovery requires fine-tuning with a frozen backbone.

Method

Conan-embedding-v3 uses a decouple--fuse--recover framework: train modality specialists, fuse task vectors into a dense backbone, then apply Projector Recovery via projector fine-tuning (frozen backbone) and balanced multi-modal rehearsal.

Topics

Omni-modal Embedding
Conan-embedding-v3
Multimodal Retrieval
Modality Fusion
Projector Drift
Audio Retrieval

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.