Estimating Uncertainty in Classifier Performance with Applications to Large Language Models and Nested Data
Summary
Researchers increasingly use text classification, including supervised models and large language models, to measure constructs from natural language, often reporting recall and precision as validity metrics. However, measures of uncertainty, such as confidence intervals, are inconsistently reported or estimated using inappropriate methods for small datasets or high-performance scenarios. This paper evaluates confidence interval methods under conditions typical of social science text classification: small to moderate sample sizes, infrequent constructs, and texts nested within individuals. Simulations show that default methods like the Wald interval and basic percentile bootstrap are inaccurate. Accuracy improves with Agresti-Coull, Wilson, Clopper-Pearson, and a novel pseudo-count regularized bootstrap, particularly for F1 scores. For nested texts, adjusting for effective N and appropriate degrees of freedom is necessary for accurate analytic intervals. The hierarchical bootstrap is more accurate than the cluster bootstrap for moderate texts per individual but overly conservative for few texts.
Key takeaway
For Research Scientists or Machine Learning Engineers reporting classifier performance, particularly with small datasets, infrequent constructs, or nested text data, you must move beyond default confidence interval methods. You should adopt Agresti-Coull, Wilson, Clopper-Pearson, or the pseudo-count regularized bootstrap for F1, and adjust for effective N and degrees of freedom with nested data. This improves transparency and ensures robust validation of your machine learning applications.
Key insights
Accurate confidence interval reporting for classifier performance metrics, especially with small or nested data, requires specific statistical methods beyond defaults.
Principles
- Default confidence interval methods are often inaccurate for small or high-performance datasets.
- Nested data requires adjustment for effective N and appropriate degrees of freedom.
- Validation sample size is critical at the design stage for robust results.
Method
The paper evaluates confidence interval methods for performance metrics under conditions typical of social science text classification, including small samples, infrequent constructs, and texts nested within individuals.
In practice
- Use Agresti-Coull, Wilson, or Clopper-Pearson for improved confidence interval accuracy.
- Apply a novel pseudo-count regularized bootstrap for F1 score confidence intervals.
- Adjust for effective N and degrees of freedom when analyzing nested text data.
Topics
- Classifier Performance
- Uncertainty Estimation
- Confidence Intervals
- Large Language Models
- Nested Data
- Text Classification
Best for: AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.