Estimating Uncertainty in Classifier Performance with Applications to Large Language Models and Nested Data

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Research Methodology & Innovation · Depth: Expert, quick

Summary

Researchers increasingly use text classification, including supervised models and large language models, to measure constructs from natural language, often reporting recall and precision as validity metrics. However, measures of uncertainty, such as confidence intervals, are inconsistently reported or estimated using inappropriate methods for small datasets or high-performance scenarios. This paper evaluates confidence interval methods under conditions typical of social science text classification: small to moderate sample sizes, infrequent constructs, and texts nested within individuals. Simulations show that default methods like the Wald interval and basic percentile bootstrap are inaccurate. Accuracy improves with Agresti-Coull, Wilson, Clopper-Pearson, and a novel pseudo-count regularized bootstrap, particularly for F1 scores. For nested texts, adjusting for effective N and appropriate degrees of freedom is necessary for accurate analytic intervals. The hierarchical bootstrap is more accurate than the cluster bootstrap for moderate texts per individual but overly conservative for few texts.

Key takeaway

For Research Scientists or Machine Learning Engineers reporting classifier performance, particularly with small datasets, infrequent constructs, or nested text data, you must move beyond default confidence interval methods. You should adopt Agresti-Coull, Wilson, Clopper-Pearson, or the pseudo-count regularized bootstrap for F1, and adjust for effective N and degrees of freedom with nested data. This improves transparency and ensures robust validation of your machine learning applications.

Key insights

Accurate confidence interval reporting for classifier performance metrics, especially with small or nested data, requires specific statistical methods beyond defaults.

Principles

Method

The paper evaluates confidence interval methods for performance metrics under conditions typical of social science text classification, including small samples, infrequent constructs, and texts nested within individuals.

In practice

Topics

Best for: AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.