The signal is the ceiling: Measurement limits of LLM-predicted experience ratings from open-ended survey text
Summary
A study analyzed the impact of prompt design and model selection on the accuracy of LLM-predicted fan experience ratings from open-ended survey text. Building on prior work (Hong, Potteiger, and Zapata 2026) that showed an unoptimized GPT 4.1 prompt predicted ratings within one point 67% of the time, this research tested four configurations across 10,000 post-game surveys from five MLB teams. Comparing an original baseline prompt with a moderately customized version, and three GPT models (4.1, 4.1-mini, 5.2), the customized prompt improved GPT 4.1's accuracy by two percentage points to 69%. However, GPT 5.2 reverted to baseline performance, and GPT 4.1-mini dropped six percentage points. The study found that the linguistic character of the input text had a significantly greater impact on accuracy than either prompt or model choice, indicating a two-part ceiling: correctable model bias via prompt design, and inherent missing information in the text that engineering cannot address.
Key takeaway
For AI Engineers developing LLM-based sentiment analysis from open-ended text, focus your efforts on understanding and improving the quality and linguistic consistency of your input data. While prompt customization can address specific model biases and yield modest gains (e.g., 2% improvement), model upgrades or swaps may not reliably enhance performance and can even degrade it. Your primary ceiling is often the inherent information content of the text itself, not just the model or prompt.
Key insights
Input text characteristics limit LLM prediction accuracy more than prompt or model choice.
Principles
- Prompt design corrects model bias.
- Model selection does not reliably improve accuracy.
- Linguistic character of text dictates accuracy.
Method
Compared baseline and customized prompts with GPT 4.1, 4.1-mini, and 5.2 on 10,000 MLB post-game surveys to measure within +/-1 agreement for experience ratings.
In practice
- Prioritize input data quality.
- Customize prompts for specific biases.
- Evaluate model swaps carefully.
Topics
- LLM Performance Limits
- Prompt Engineering
- Model Selection
- Experience Rating Prediction
- Open-ended Survey Analysis
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.