BEST-RQ-2: Contextualize-Then-Predict, a Two-Step Approach for Self-Supervised Audio Representations
Summary
BEST-RQ-2 is a new self-supervised learning model for audio representations, evolving from BEST-RQ by introducing a two-step "contextualize-then-predict" pretraining scheme. This approach utilizes a Vision Transformer (ViT) context encoder to process unmasked spectrogram regions, while a lightweight predictor infers targets for masked areas, which is then discarded post-pretraining. Unlike its predecessor, BEST-RQ-2 replaces the Conformer encoder with a ViT, resulting in a performance shift: a slight reduction in speech domain efficacy but an improvement in music and environmental sound processing, maintaining comparable average scores. The core innovation lies in this decomposed masked prediction. BEST-RQ-2 consistently outperforms one-stage baselines on the X-ARES and XARES-LLM benchmarks for overall transfer, all while keeping inference compute unchanged. Code and model checkpoints are publicly available.
Key takeaway
For Machine Learning Engineers developing self-supervised audio models, you should consider adopting BEST-RQ-2's "contextualize-then-predict" two-step pretraining. This approach offers superior overall transfer performance on benchmarks like X-ARES and XARES-LLM without increasing inference compute. If your application prioritizes music or environmental sounds over speech, the ViT encoder shift could be particularly beneficial. Explore the publicly available code and model checkpoints to integrate this method.
Key insights
BEST-RQ-2's two-step contextualize-then-predict scheme improves self-supervised audio representation transfer by decoupling context encoding from masked prediction.
Principles
- Decomposing masked prediction enhances transfer.
- ViT encoders can rebalance audio domain performance.
- Frozen random-projection targets remain effective.
Method
BEST-RQ-2 employs a two-step pretraining: a ViT context encoder processes unmasked spectrograms, followed by a lightweight predictor inferring targets for masked regions. The predictor is discarded after pretraining.
In practice
- Implement two-step pretraining for audio tasks.
- Evaluate ViT encoders for non-speech audio.
- Utilize public BEST-RQ-2 code and checkpoints.
Topics
- Self-Supervised Learning
- Audio Representations
- Vision Transformers
- Masked Prediction
- Pretraining Architectures
- X-ARES Benchmark
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.