Explain Like I'm 5 or Whatever I Choose: Evaluating the Interactive Potential of Language Model Responses
Summary
The article proposes a new evaluation framework for large language models (LLMs), focusing on their ability to generate multiple responses to a single query that vary along an interpretable axis of language complexity. A formative study with 16 participants validated the utility of interactive complexity, identifying jargon, information, and length as key factors. The evaluation tested GPT-5.1, GPT-5 mini, Claude Sonnet 4.5 + Thinking, and DeepSeek-V3.1 on 98 scientific queries, generating 5 responses at different complexity levels. Findings indicate that while models vary complexity across responses, most changes remain inconsistent; Claude Sonnet 4.5, the best performer, reliably shifted complexity measures in the correct direction only 46% of the time for jargon and 33% for information. Models struggled to differentiate responses at higher complexity levels, and increasing sample size or widening audience labels did not significantly alter these inconsistencies.
Key takeaway
For AI Scientists and ML Engineers developing LLM-powered interfaces, you should prioritize robust control over response complexity beyond simple length adjustments. Current models inconsistently manage jargon and information density across complexity levels, even with varied audience prompts. Focus on fine-tuning models to reliably scale these attributes to enable effective direct manipulation interfaces, ensuring user expectations for nuanced complexity changes are met.
Key insights
LLMs inconsistently adjust response complexity, failing to reliably differentiate jargon and information across user-controlled levels.
Principles
- LLM evaluations must incorporate interface-specific criteria beyond static chat.
- Direct manipulation of response complexity is valuable for users.
- Jargon, information, and length are key dimensions of perceived complexity.
Method
A framework evaluating LLMs' ability to generate 5 responses to a query, differing along an interpretable language complexity axis, using measures for jargon, information, and length.
In practice
- Implement direct manipulation sliders for LLM response complexity.
- Focus on consistent jargon and information scaling, not just length.
- Test LLMs with diverse audience anchors for complexity.
Topics
- LLM Evaluation
- Language Complexity
- Human-Computer Interaction
- Direct Manipulation Interfaces
- Scientific Information Seeking
- GPT-5.1, Claude Sonnet 4.5
Best for: AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.