LLMs Can Better Capture Human Judgments--With the Right Prompts
Summary
A study published on 2026-06-10 demonstrates that Large Language Models (LLMs) can more accurately capture human judgments through specific prompting strategies, mitigating common issues like failing to capture full response distributions and unstable judgments due to wording variations. Researchers applied these techniques across two datasets: 144 U.S.-representative moral scenarios and 38 moral beliefs from the International Social Survey Programme covering 32 countries. Key findings include that prompting LLMs to report standard deviations and response proportions effectively recovers the full spectrum of human responses. Additionally, ensuring scenario clarity for human participants, as indicated by human confusion ratings, significantly improves model alignment, with LLMs capable of tracking these confusion levels. However, LLMs' self-estimated error calibration remains poor, despite their ability to predict human variability.
Key takeaway
Prompt engineers seeking to align LLMs with human judgments must design prompts that explicitly elicit full response distributions. Include standard deviations and proportions. Prioritize scenario clarity; human confusion ratings directly correlate with improved model performance. While LLMs predict human variability well, be cautious about their self-estimated error. Instead, focus on robust elicitation techniques to capture nuanced human perspectives.
Key insights
Asking better questions to LLMs through specific prompting strategies improves their ability to capture human judgments and response distributions.
Principles
- LLMs can track human confusion ratings.
- Prompting for distributions recovers full human responses.
- Scenario clarity boosts LLM-human alignment.
Method
Prompt LLMs to report standard deviations and response proportions. Ensure scenario clarity, leveraging human confusion ratings to improve model alignment.
In practice
- Use prompts for standard deviations.
- Incorporate human confusion ratings.
- Elicit full response distributions.
Topics
- Large Language Models
- Prompt Engineering
- Human Judgment
- AI Alignment
- Response Distributions
- Moral Reasoning
Best for: AI Engineer, Machine Learning Engineer, Research Scientist, AI Scientist, NLP Engineer, Prompt Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.