Can LLMs Think Like Consumers? Benchmarking Crowd-Level Reaction Reconstruction with ConsumerSimBench

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Marketing, Branding & Advertising · Depth: Expert, extended

Summary

Researchers have introduced ConsumerSimBench, a new benchmark designed to evaluate how accurately Large Language Models (LLMs) can reconstruct real consumer reaction patterns in public discourse. Built from 1,553 Chinese social-media topics and 23,122 atomic, rule-audited criteria across four reaction families (sentiment flashpoints, emotion keywords, positive aspects, and negative aspects), the benchmark moves beyond holistic preference scoring. Instead, it uses auditable yes-no decisions for concrete reaction points, achieving 92.1% three-judge agreement and 98.4% agreement with human-majority labels. Testing 13 frontier LLMs, the study found that the strongest model, Gemini-3.1-Pro, covered only 47.8% of real reaction criteria. Other models like GPT-5.2 and Claude-4.6 performed significantly worse, highlighting a substantial gap between technical benchmark performance and socially grounded consumer intuition. The research indicates that LLMs struggle with anticipating specific social triggers and criticism vectors, even when generating fluent, emotionally varied comments.

Key takeaway

For AI Product Managers developing LLM-based consumer simulation tools, recognize that current frontier models are not yet reliable for forecasting specific public discourse reactions. Your systems may generate fluent, emotionally plausible comments but still miss critical sentiment flashpoints and negative aspects that drive real-world engagement or crises. Prioritize integrating iterative refinement pipelines and focus on benchmarking against concrete, auditable reaction criteria rather than relying on holistic preference scores to ensure your models can anticipate what consumers will actually care about.

Key insights

LLMs struggle to reconstruct specific, socially grounded consumer reactions in public discourse, despite strong technical performance.

Principles

Method

ConsumerSimBench uses 1,553 real social-media topics and 23,122 atomic criteria across four reaction families. LLMs generate comments, which are then evaluated via binary yes-no decisions against these criteria by an LLM judge.

In practice

Topics

Best for: Research Scientist, AI Product Manager, AI Scientist, Machine Learning Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.