BASENet: Band-Adapted Speech Enhancement Network with Cross-Band Attention
Summary
BASENet is a novel speech enhancement network designed with a frequency-adapted architecture that addresses the non-uniform spectral resolution of human hearing. It partitions the audio spectrum into Bark-scale bands, assigning scaled-capacity encoders based on critical-band density, which results in deeper processing for perceptually dense low frequencies and lighter processing for high frequencies. The network incorporates a cross-band attention module to capture harmonic dependencies across bands using compact frequency-pooled representations, operating with linear complexity. Built upon inverted residual blocks, dense connectivity, and a convolutional recurrent network, BASENet achieves a 3.55 PESQ score and STOI~96% on the VoiceBank+DEMAND dataset. Notably, it uses only 0.83M parameters and 7.3 G MACs, making it the most parameter-efficient method among those exceeding 3.50 PESQ. A causal variant, achieving 3.44 PESQ, further demonstrates its suitability for real-time streaming on resource-constrained devices.
Key takeaway
For Machine Learning Engineers developing real-time speech enhancement systems, BASENet offers a highly efficient architecture. You should consider its frequency-adapted design and cross-band attention for optimizing performance on resource-constrained devices. Its causal variant, achieving 3.44 PESQ with only 0.83M parameters, provides a strong benchmark for balancing high-quality output with minimal computational overhead, enabling deployment in edge computing scenarios.
Key insights
BASENet enhances speech by adapting network capacity to Bark-scale frequency bands and using cross-band attention for efficient, perceptually-aware processing.
Principles
- Human hearing has non-uniform spectral resolution.
- Adapt network capacity to perceptual frequency density.
- Cross-band attention captures harmonic dependencies.
Method
BASENet partitions the spectrum into Bark-scale bands, assigning scaled-capacity encoders. It uses a cross-band attention module with frequency-pooled representations and is built on inverted residual blocks and a convolutional recurrent network.
In practice
- Deploy causal variant for real-time streaming.
- Optimize speech enhancement for resource-constrained devices.
- Improve PESQ/STOI with minimal parameters.
Topics
- Speech Enhancement
- Neural Networks
- Bark Scale
- Cross-Band Attention
- Real-time Audio Processing
- Resource-Constrained Devices
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.