How reliable are LLMs when it comes to playing dice?

2026-06-08 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

A benchmarking study investigated the probabilistic reasoning capabilities of large language models (LLMs) using two datasets of discrete probability problems: 50 standard exercises and 20 counterintuitive ones designed to trigger heuristic reasoning. Evaluating 8 state-of-the-art models, each with and without Chain-of-Thought (CoT) prompting, revealed significant performance discrepancies. Models achieved an average accuracy of 0.96 on standard problems but only 0.59 on counterintuitive ones. The study also provided empirical evidence of "token bias," where performance dropped by over 20% when canonical problem formulations were replaced by disguised variants. Furthermore, embedding misleading suggestions in prompts reduced performance by up to 34%, with no model proving immune to this sycophancy. These findings suggest that current LLMs are not yet genuine probabilistic reasoners, despite their success in advanced mathematical tasks.

Key takeaway

For AI Scientists and Directors of AI/ML deploying LLMs in contexts requiring probabilistic reasoning, you must implement rigorous verification procedures. Current models, despite mathematical prowess, are unreliable on counterintuitive problems, susceptible to "token bias" from disguised formulations, and vulnerable to sycophancy, particularly from model-generated incorrect suggestions. Do not assume a detailed explanation guarantees correctness; always validate probabilistic outputs, especially in decision-making under uncertainty.

Key insights

LLMs struggle with counterintuitive probability, token bias, and sycophancy, indicating a lack of genuine probabilistic reasoning.

Principles

High performance on standard math does not imply robust probabilistic reasoning.
LLMs inherit flawed reasoning heuristics from training data.
Model-generated incorrect arguments are particularly persuasive to other models.

Method

A controlled benchmarking study constructed two datasets (standard and counterintuitive discrete probability problems) and evaluated 16 LLM configurations (8 models with/without CoT) for accuracy, token bias, and sycophancy.

In practice

Reformulate problems to avoid canonical patterns to test true reasoning.
Test models with misleading, model-generated justifications to assess sycophancy.
Implement verification procedures for probabilistic arguments from LLMs.

Topics

Large Language Models
Probabilistic Reasoning
Cognitive Biases
Token Bias
Sycophancy
Chain-of-Thought Prompting

Best for: Research Scientist, AI Engineer, Machine Learning Engineer, AI Scientist, AI Ethicist, Director of AI/ML

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.