How reliable are LLMs when it comes to playing dice?

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

A benchmarking study investigated the probabilistic reasoning capabilities of large language models (LLMs) using two datasets of discrete probability problems: 50 standard exercises and 20 counterintuitive ones designed to trigger heuristic reasoning. Evaluating 8 state-of-the-art models, each with and without Chain-of-Thought (CoT) prompting, revealed significant performance discrepancies. Models achieved an average accuracy of 0.96 on standard problems but only 0.59 on counterintuitive ones. The study also provided empirical evidence of "token bias," where performance dropped by over 20% when canonical problem formulations were replaced by disguised variants. Furthermore, embedding misleading suggestions in prompts reduced performance by up to 34%, with no model proving immune to this sycophancy. These findings suggest that current LLMs are not yet genuine probabilistic reasoners, despite their success in advanced mathematical tasks.

Key takeaway

For AI Scientists and Directors of AI/ML deploying LLMs in contexts requiring probabilistic reasoning, you must implement rigorous verification procedures. Current models, despite mathematical prowess, are unreliable on counterintuitive problems, susceptible to "token bias" from disguised formulations, and vulnerable to sycophancy, particularly from model-generated incorrect suggestions. Do not assume a detailed explanation guarantees correctness; always validate probabilistic outputs, especially in decision-making under uncertainty.

Key insights

LLMs struggle with counterintuitive probability, token bias, and sycophancy, indicating a lack of genuine probabilistic reasoning.

Principles

Method

A controlled benchmarking study constructed two datasets (standard and counterintuitive discrete probability problems) and evaluated 16 LLM configurations (8 models with/without CoT) for accuracy, token bias, and sycophancy.

In practice

Topics

Best for: Research Scientist, AI Engineer, Machine Learning Engineer, AI Scientist, AI Ethicist, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.