Echo: A Joint-Embedding Predictive Architecture for Speaker Diarization and Speech Recognition in a Shared Latent Space
Summary
Echo is a proof-of-concept audio system centered on a single 25 M-parameter ViT encoder. This encoder is pretrained using a Joint-Embedding Predictive Architecture (JEPA) objective and then specialized in stages to handle speaker identity, phonetic content, and dynamic source routing within the same 512-dimensional latent space, crucially without per-task fine-tuning at deployment. Light heads manage diarization via ArcFace + VBx and dynamic source separation using null-target K-set prediction. On synthetic VoxCeleb2 mixtures with unknown K, Echo achieves 15.00% blind DER, 97.80% PIT separation accuracy with +9.52 dB latent SI-SDR, and a +53.50-point speaker/content factorisation gap. The system's primary contribution is the joint coexistence of these three tasks on one encoder at this footprint, though a structural wall on end-to-end ASR through the VQ bottleneck was identified.
Key takeaway
For Machine Learning Engineers developing multi-modal audio systems, Echo demonstrates a viable path to consolidate speaker diarization, speech recognition, and source separation onto a single, compact encoder. You should consider this joint-embedding architecture to reduce model footprint and eliminate per-task fine-tuning, while acknowledging the current VQ bottleneck for end-to-end ASR.
Key insights
A single 25M-parameter ViT encoder can jointly perform speaker diarization, speech recognition, and source separation in a shared latent space.
Principles
- Achieve multi-task audio processing with a single encoder.
- Eliminate per-task fine-tuning at deployment.
Method
Pretrain a ViT encoder with JEPA, then specialize it for speaker identity, phonetic content, and dynamic source routing in a 512-dimensional latent space, using light heads for specific tasks.
In practice
- Integrate speaker diarization and speech recognition.
- Perform dynamic source separation with minimal overhead.
Topics
- Speaker Diarization
- Speech Recognition
- Joint-Embedding Predictive Architecture
- ViT Encoder
- Latent Space
- Source Separation
- VoxCeleb2
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.