A Calculus-Based Framework for Determining Vocabulary Size in End-to-End ASR
Summary
This paper introduces a calculus-based framework for analytically determining the optimal vocabulary size in end-to-end Automatic Speech Recognition (ASR) systems, moving beyond heuristic choices. Unlike hybrid ASR, end-to-end systems derive vocabulary from text corpora, making vocabulary size a critical hyper-parameter. The authors extend previous work by curve-fitting training data statistics, specifically token imbalance $\Delta(n)$ and total tokens $\Theta(n)$, and applying first and second derivative tests from calculus to estimate the optimal size. They model these corpus-dependent cost components as differentiable functions, first using second-order polynomials and then an improved polynomial-exponential model. Experimental validation on the LibriSpeech-100 corpus with a Conformer-based ASR model demonstrates that the analytically derived optimal vocabulary size, such as $n^*=61$ with the polynomial-exponential fit, yields competitive or improved Word Error Rates (WERs) compared to the commonly used heuristic $n=300$ (e.g., 13.60% vs. 14.55% test-avg WER).
Key takeaway
For research scientists optimizing end-to-end ASR systems, you should adopt a calculus-based approach to determine vocabulary size rather than relying on undocumented heuristics. By accurately modeling corpus statistics like token imbalance and total tokens with differentiable functions, you can analytically derive an optimal vocabulary size that improves Word Error Rate, as demonstrated by $n^*=61$ outperforming $n=300$ on LibriSpeech-100. This principled method offers explainability and better performance compared to empirical grid searches.
Key insights
A calculus-based framework analytically optimizes ASR vocabulary size, outperforming heuristic methods via differentiable cost functions.
Principles
- Vocabulary size is a critical ASR hyper-parameter.
- Corpus statistics can be modeled as differentiable functions.
- Optimality conditions can be derived using calculus.
Method
The method involves modeling corpus-derived cost components $\Delta(n)$ and $\Theta(n)$ as smooth, differentiable functions (polynomial-exponential), then applying first and second derivative tests to solve for the optimal vocabulary size $n^*$.
In practice
- Use polynomial-exponential models for $\Delta(n)$ and $\Theta(n)$.
- Normalize cost components for cross-dataset stability.
- Validate $n^*$ with Conformer-based ASR models.
Topics
- End-to-End ASR
- Vocabulary Size Optimization
- Calculus-Based Modeling
- Tokenization Algorithms
- Conformer Architecture
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.