LLMs Can Better Capture Human Judgments--With the Right Prompts

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

The paper "LLMs Can Better Capture Human Judgments--With the Right Prompts" by Danica Dillion et al. investigates methods to improve large language models' (LLMs) ability to align with human judgments. It addresses two common limitations: LLMs' failure to capture full distributions of responses and their instability across wording variations. The research demonstrates simple prompting strategies that mitigate these issues across two datasets: 144 U.S.-representative moral scenarios and 38 moral beliefs from the International Social Survey Programme, spanning 32 countries. Key findings include that prompting models to report standard deviations and response proportions significantly improves the recovery of human response ranges. Additionally, ensuring scenario clarity, as indicated by human confusion ratings, enhances model alignment, with LLMs proving capable of tracking these human confusion levels. While LLMs predict human variability well, their self-estimated error calibration remains poor.

Key takeaway

For prompt engineers and AI scientists developing LLM applications that require accurate human judgment capture, you should prioritize prompt designs that explicitly ask for response distributions, such as standard deviations and proportions, rather than single answers. This approach significantly enhances the model's ability to reflect the full range of human responses. Additionally, validate your scenarios for human clarity, using LLMs' capability to track human confusion ratings to refine and optimize your prompts for better alignment.

Key insights

Simple prompting strategies, including eliciting statistical distributions and ensuring scenario clarity, significantly improve LLM alignment with human judgment.

Principles

Method

Improve LLM-human alignment by prompting models to report standard deviations and response proportions. Also, ensure scenario clarity, guided by human confusion ratings, which LLMs can track.

In practice

Topics

Best for: Research Scientist, AI Engineer, Machine Learning Engineer, AI Scientist, Prompt Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.