Conan-embedding-v3: Fusing Modality-Specific Models for Omni-Modal Embedding
Summary
Conan-embedding-v3 is a novel decouple--fuse--recover framework designed for omni-modal retrieval, aiming to create a single embedding space for text, image, video, document, and audio inputs. The system first employs "Decoupled Specialist Fusion," independently training modality specialists and then fusing their task vectors into a unified dense backbone. While this strategy effectively combines visual, video, and document retrieval capabilities, it introduces a "Projector Drift" failure for projector-based modalities like audio, causing significant performance regression despite preserving audio-specific modules. To address this, Conan-embedding-v3 implements "Projector Recovery," involving full-parameter fine-tuning of the projector while keeping the backbone frozen, followed by balanced multi-modal rehearsal. The resulting model successfully integrates these diverse retrieval pathways into one backbone, achieving 74.9 scores on MMEB and 55.61 on the 30-task MAEB audio suite.
Key takeaway
For Machine Learning Engineers building omni-modal retrieval systems, you should anticipate "Projector Drift" when fusing modality-specific backbones, particularly for projector-based inputs like audio. This phenomenon can cause significant performance regression even if modality-specific modules remain unchanged. To mitigate this, consider implementing a "Projector Recovery" phase, fine-tuning only the projector while keeping the fused backbone frozen, followed by balanced multi-modal rehearsal to ensure robust performance across all integrated modalities.
Key insights
Fusing modality-specific models for omni-modal embeddings requires addressing "Projector Drift" through targeted recovery.
Principles
- Decoupled Specialist Fusion composes modality capabilities.
- Projector Drift causes regression in fused projector-based systems.
Method
Conan-embedding-v3 uses a decouple--fuse--recover framework: train specialists, fuse task vectors into a backbone, then apply full-parameter fine-tuning of projectors with a frozen backbone, followed by multi-modal rehearsal.
Topics
- Omni-modal Embedding
- Multi-modal Retrieval
- Modality Fusion
- Projector Drift
- Conan-embedding-v3
- Deep Learning Architectures
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.