The signal is the ceiling: Measurement limits of LLM-predicted experience ratings from open-ended survey text

2026-04-21 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

A study analyzed the impact of prompt design and model selection on the accuracy of LLM-predicted fan experience ratings from open-ended survey text. Building on prior work (Hong, Potteiger, and Zapata 2026) that showed an unoptimized GPT 4.1 prompt predicted ratings within one point 67% of the time, this research tested four configurations across 10,000 post-game surveys from five MLB teams. Comparing an original baseline prompt with a moderately customized version, and three GPT models (4.1, 4.1-mini, 5.2), the customized prompt improved GPT 4.1's accuracy by two percentage points to 69%. However, GPT 5.2 reverted to baseline performance, and GPT 4.1-mini dropped six percentage points. The study found that the linguistic character of the input text had a significantly greater impact on accuracy than either prompt or model choice, indicating a two-part ceiling: correctable model bias via prompt design, and inherent missing information in the text that engineering cannot address.

Key takeaway

For AI Engineers developing LLM-based sentiment analysis from open-ended text, focus your efforts on understanding and improving the quality and linguistic consistency of your input data. While prompt customization can address specific model biases and yield modest gains (e.g., 2% improvement), model upgrades or swaps may not reliably enhance performance and can even degrade it. Your primary ceiling is often the inherent information content of the text itself, not just the model or prompt.

Key insights

Input text characteristics limit LLM prediction accuracy more than prompt or model choice.

Principles

Prompt design corrects model bias.
Model selection does not reliably improve accuracy.
Linguistic character of text dictates accuracy.

Method

Compared baseline and customized prompts with GPT 4.1, 4.1-mini, and 5.2 on 10,000 MLB post-game surveys to measure within +/-1 agreement for experience ratings.

In practice

Prioritize input data quality.
Customize prompts for specific biases.
Evaluate model swaps carefully.

Topics

LLM Performance Limits
Prompt Engineering
Model Selection
Experience Rating Prediction
Open-ended Survey Analysis

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.