BASENet: Band-Adapted Speech Enhancement Network with Cross-Band Attention

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Speech Processing · Depth: Expert, quick

Summary

BASENet is a novel speech enhancement network designed with a frequency-adapted architecture that addresses the non-uniform spectral resolution of human hearing. It partitions the audio spectrum into Bark-scale bands, assigning scaled-capacity encoders based on critical-band density, which results in deeper processing for perceptually dense low frequencies and lighter processing for high frequencies. The network incorporates a cross-band attention module to capture harmonic dependencies across bands using compact frequency-pooled representations, operating with linear complexity. Built upon inverted residual blocks, dense connectivity, and a convolutional recurrent network, BASENet achieves a 3.55 PESQ score and STOI~96% on the VoiceBank+DEMAND dataset. Notably, it uses only 0.83M parameters and 7.3 G MACs, making it the most parameter-efficient method among those exceeding 3.50 PESQ. A causal variant, achieving 3.44 PESQ, further demonstrates its suitability for real-time streaming on resource-constrained devices.

Key takeaway

For Machine Learning Engineers developing real-time speech enhancement systems, BASENet offers a highly efficient architecture. You should consider its frequency-adapted design and cross-band attention for optimizing performance on resource-constrained devices. Its causal variant, achieving 3.44 PESQ with only 0.83M parameters, provides a strong benchmark for balancing high-quality output with minimal computational overhead, enabling deployment in edge computing scenarios.

Key insights

BASENet enhances speech by adapting network capacity to Bark-scale frequency bands and using cross-band attention for efficient, perceptually-aware processing.

Principles

Method

BASENet partitions the spectrum into Bark-scale bands, assigning scaled-capacity encoders. It uses a cross-band attention module with frequency-pooled representations and is built on inverted residual blocks and a convolutional recurrent network.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.