BEST-RQ-2: Contextualize-Then-Predict, a Two-Step Approach for Self-Supervised Audio Representations

2026-06-29 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

BEST-RQ-2 is a new self-supervised learning model for audio representations, evolving from BEST-RQ by introducing a two-step "contextualize-then-predict" pretraining scheme. This approach utilizes a Vision Transformer (ViT) context encoder to process unmasked spectrogram regions, while a lightweight predictor infers targets for masked areas, which is then discarded post-pretraining. Unlike its predecessor, BEST-RQ-2 replaces the Conformer encoder with a ViT, resulting in a performance shift: a slight reduction in speech domain efficacy but an improvement in music and environmental sound processing, maintaining comparable average scores. The core innovation lies in this decomposed masked prediction. BEST-RQ-2 consistently outperforms one-stage baselines on the X-ARES and XARES-LLM benchmarks for overall transfer, all while keeping inference compute unchanged. Code and model checkpoints are publicly available.

Key takeaway

For Machine Learning Engineers developing self-supervised audio models, you should consider adopting BEST-RQ-2's "contextualize-then-predict" two-step pretraining. This approach offers superior overall transfer performance on benchmarks like X-ARES and XARES-LLM without increasing inference compute. If your application prioritizes music or environmental sounds over speech, the ViT encoder shift could be particularly beneficial. Explore the publicly available code and model checkpoints to integrate this method.

Key insights

BEST-RQ-2's two-step contextualize-then-predict scheme improves self-supervised audio representation transfer by decoupling context encoding from masked prediction.

Principles

Decomposing masked prediction enhances transfer.
ViT encoders can rebalance audio domain performance.
Frozen random-projection targets remain effective.

Method

BEST-RQ-2 employs a two-step pretraining: a ViT context encoder processes unmasked spectrograms, followed by a lightweight predictor inferring targets for masked regions. The predictor is discarded after pretraining.

In practice

Implement two-step pretraining for audio tasks.
Evaluate ViT encoders for non-speech audio.
Utilize public BEST-RQ-2 code and checkpoints.

Topics

Self-Supervised Learning
Audio Representations
Vision Transformers
Masked Prediction
Pretraining Architectures
X-ARES Benchmark

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.