S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models

2026-04-29 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, long

Summary

S-SONDO (Self-Supervised KnOwledge DistillatioN for General AuDio FOundation Models) is a novel framework designed to distill general audio foundation models using only their output embeddings. This approach addresses the challenge of large, computationally expensive audio models by enabling the creation of smaller, more efficient student models without requiring class logits, intermediate features, or architecture-specific techniques. S-SONDO is architecture-agnostic and broadly applicable to embedding-based teachers. The framework was evaluated by distilling two audio foundation models, M2D and MATPAC++, into three student architectures: MobileNetV3 (2.9M parameters), DynamicMobileNet (8.7M parameters), and ERes2Net (1.4M parameters). The distilled student models achieved up to 96.4% of the teacher's performance while being up to 61x smaller, with the best result being MATPAC++ to MobileNetV3, retaining 96.4% performance with ~30x fewer parameters. The study also provides insights into loss function choices, favoring cosine similarity, and the use of clustering-based balanced data sampling.

Key takeaway

For AI Engineers and Research Scientists developing efficient audio models, S-SONDO offers a robust method to significantly reduce model size and inference costs without sacrificing performance. You should explore this self-supervised knowledge distillation framework to compress large foundation models, especially when working with embedding-only teachers. Prioritize cosine similarity for your distillation loss and experiment with cluster-based balanced data sampling to optimize student model training and convergence, particularly for smaller architectures.

Key insights

S-SONDO enables efficient self-supervised knowledge distillation for audio models using only output embeddings, making it architecture-agnostic.

Principles

Embedding alignment provides rich training signals.
Balanced data sampling benefits smaller student models.
Cosine loss is effective for embedding-only distillation.

Method

S-SONDO projects student embeddings into the teacher's latent space via a mapping head, then aligns them using a distillation loss, such as cosine similarity, enhanced by cluster-based balanced data sampling.

In practice

Use cosine similarity as the primary distillation loss.
Implement cluster-based balanced data sampling for smaller students.
Consider batch size impact on contrastive losses like CLAP.

Topics

Self-Supervised Knowledge Distillation
General Audio Models
Model Compression
Embedding Alignment
Balanced Data Sampling

Code references

MedAliAdlouni/ssondo

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.