A Calculus-Based Framework for Determining Vocabulary Size in End-to-End ASR

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, extended

Summary

This paper introduces a calculus-based framework for analytically determining the optimal vocabulary size in end-to-end Automatic Speech Recognition (ASR) systems, moving beyond heuristic choices. Unlike hybrid ASR, end-to-end systems derive vocabulary from text corpora, making vocabulary size a critical hyper-parameter. The authors extend previous work by curve-fitting training data statistics, specifically token imbalance $\Delta(n)$ and total tokens $\Theta(n)$, and applying first and second derivative tests from calculus to estimate the optimal size. They model these corpus-dependent cost components as differentiable functions, first using second-order polynomials and then an improved polynomial-exponential model. Experimental validation on the LibriSpeech-100 corpus with a Conformer-based ASR model demonstrates that the analytically derived optimal vocabulary size, such as $n^*=61$ with the polynomial-exponential fit, yields competitive or improved Word Error Rates (WERs) compared to the commonly used heuristic $n=300$ (e.g., 13.60% vs. 14.55% test-avg WER).

Key takeaway

For research scientists optimizing end-to-end ASR systems, you should adopt a calculus-based approach to determine vocabulary size rather than relying on undocumented heuristics. By accurately modeling corpus statistics like token imbalance and total tokens with differentiable functions, you can analytically derive an optimal vocabulary size that improves Word Error Rate, as demonstrated by $n^*=61$ outperforming $n=300$ on LibriSpeech-100. This principled method offers explainability and better performance compared to empirical grid searches.

Key insights

A calculus-based framework analytically optimizes ASR vocabulary size, outperforming heuristic methods via differentiable cost functions.

Principles

Method

The method involves modeling corpus-derived cost components $\Delta(n)$ and $\Theta(n)$ as smooth, differentiable functions (polynomial-exponential), then applying first and second derivative tests to solve for the optimal vocabulary size $n^*$.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.