Towards Data-free and Training-free Compression for Speech Foundation Models Using Parameter Clustering

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, long

Summary

A novel data-free and training-free compression approach for speech foundation models, utilizing channel-wise clustering via k-means, is introduced. This method also explores mixed sparsity pruning by layer-level varying numbers of parameter clusters. Experiments on the LibriSpeech dataset demonstrate that when applied to HuBERT-large at 50% sparsity, the method achieved absolute Word Error Rate (WER) reductions of 27.73% on test-clean and 18.61% on test-other compared to magnitude-based pruning before fine-tuning. After 3 epochs of fine-tuning, WER reductions were 0.19% and 0.79% respectively. For Whisper-large-v3 at 10% sparsity, absolute WER reductions of 2.86% and 5.02% were observed against magnitude-based pruning, with no significant WER increase relative to the uncompressed baseline. The approach produces hardware-friendly, coarse-grained compressed models.

Key takeaway

For Machine Learning Engineers optimizing speech foundation models for resource-constrained environments, consider implementing parameter clustering. This data-free, training-free approach significantly reduces model size and computational demands while maintaining or improving Word Error Rate compared to traditional magnitude-based pruning. You can achieve substantial compression on models like HuBERT-large and Whisper-large-v3, enabling deployment on standard hardware without specialized libraries.

Key insights

Parameter clustering offers data-free, training-free, and hardware-friendly compression for speech foundation models, outperforming magnitude-based pruning.

Principles

Merging similar parameters preserves collective information.
Higher parameter variance indicates more complex information.
Structured compression is compatible with general-purpose hardware.

Method

Apply k-means clustering to structured units (attention heads, FFN units) to merge similar components into K centroids, replacing originals. Use variance-based mixed sparsity to adaptively assign K per layer.

In practice

Compress HuBERT-large or Whisper-large-v3 without data or training.
Achieve significant WER reductions over magnitude pruning.
Deploy compressed models on standard hardware platforms.

Topics

Speech Foundation Models
Model Compression
Parameter Clustering
Automatic Speech Recognition
HuBERT
Whisper

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.