LLMs Can Better Capture Human Judgments--With the Right Prompts

2026-06-10 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

A study published on 2026-06-10 demonstrates that Large Language Models (LLMs) can more accurately capture human judgments through specific prompting strategies, mitigating common issues like failing to capture full response distributions and unstable judgments due to wording variations. Researchers applied these techniques across two datasets: 144 U.S.-representative moral scenarios and 38 moral beliefs from the International Social Survey Programme covering 32 countries. Key findings include that prompting LLMs to report standard deviations and response proportions effectively recovers the full spectrum of human responses. Additionally, ensuring scenario clarity for human participants, as indicated by human confusion ratings, significantly improves model alignment, with LLMs capable of tracking these confusion levels. However, LLMs' self-estimated error calibration remains poor, despite their ability to predict human variability.

Key takeaway

Prompt engineers seeking to align LLMs with human judgments must design prompts that explicitly elicit full response distributions. Include standard deviations and proportions. Prioritize scenario clarity; human confusion ratings directly correlate with improved model performance. While LLMs predict human variability well, be cautious about their self-estimated error. Instead, focus on robust elicitation techniques to capture nuanced human perspectives.

Key insights

Asking better questions to LLMs through specific prompting strategies improves their ability to capture human judgments and response distributions.

Principles

LLMs can track human confusion ratings.
Prompting for distributions recovers full human responses.
Scenario clarity boosts LLM-human alignment.

Method

Prompt LLMs to report standard deviations and response proportions. Ensure scenario clarity, leveraging human confusion ratings to improve model alignment.

In practice

Use prompts for standard deviations.
Incorporate human confusion ratings.
Elicit full response distributions.

Topics

Large Language Models
Prompt Engineering
Human Judgment
AI Alignment
Response Distributions
Moral Reasoning

Best for: AI Engineer, Machine Learning Engineer, Research Scientist, AI Scientist, NLP Engineer, Prompt Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.