Conan-embedding-v3: Fusing Modality-Specific Models for Omni-Modal Embedding
Summary
Conan-embedding-v3 is an omni-modal embedding model designed for unified retrieval across text, image, video, document, and audio inputs. It employs a "decouple--fuse--recover" framework to address challenges in integrating diverse modalities. The model first trains modality specialists independently, then fuses their task vectors into a single dense backbone, a process termed Decoupled Specialist Fusion. This fusion successfully combines visual, video, and document retrieval capabilities. However, it introduces "Projector Drift" for projector-based modalities like audio, where the projector remains calibrated to the specialist backbone, causing significant retrieval regression. To mitigate this, Conan-embedding-v3 implements Projector Recovery, involving full-parameter fine-tuning of the projector with a frozen backbone, followed by balanced multi-modal rehearsal. The resulting model achieves 74.9 scores on MMEB and 55.61 on the 30-task MAEB audio suite.
Key takeaway
For AI Scientists developing omni-modal embedding models, understanding the "decouple--fuse--recover" framework is crucial. If you are integrating projector-based modalities like audio, be aware of "Projector Drift" after fusing backbones. You should implement a Projector Recovery phase, involving full-parameter fine-tuning of the projector with a frozen backbone, followed by balanced multi-modal rehearsal, to prevent significant retrieval regression and ensure robust performance across all modalities.
Key insights
Conan-embedding-v3 fuses modality specialists for omni-modal embeddings, but requires "Projector Recovery" to prevent "Projector Drift" in projector-based modalities.
Principles
- Decoupled Specialist Fusion composes diverse retrieval capabilities.
- Backbone fusion can cause "Projector Drift" for external projectors.
- Projector Recovery requires fine-tuning with a frozen backbone.
Method
Conan-embedding-v3 uses a decouple--fuse--recover framework: train modality specialists, fuse task vectors into a dense backbone, then apply Projector Recovery via projector fine-tuning (frozen backbone) and balanced multi-modal rehearsal.
Topics
- Omni-modal Embedding
- Conan-embedding-v3
- Multimodal Retrieval
- Modality Fusion
- Projector Drift
- Audio Retrieval
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.