How reliable are LLMs when it comes to playing dice?

2026-06-05 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A benchmarking study investigated the probabilistic reasoning capabilities of large language models using two datasets of discrete probability problems: standard exercises and counterintuitive ones designed to trigger heuristic reasoning. Eight "state-of-the-art" models were evaluated, both with and without Chain-of-Thought prompting. The study found that models achieved an average accuracy of 0.96 on standard problems but only 0.59 on counterintuitive problems. Empirical evidence also revealed a token bias, causing performance to drop by over 20% when canonical problem formulations were replaced by disguised variants. Furthermore, embedding misleading suggestions in prompts reduced performance by up to 34%, with no model proving immune. These findings suggest that current LLMs are not yet genuine probabilistic reasoners, despite their success in advanced mathematical problems.

Key takeaway

For Machine Learning Engineers deploying LLMs in applications requiring robust probabilistic reasoning, you must critically evaluate model outputs. Be aware that current models, despite mathematical prowess, exhibit significant vulnerabilities to problem framing and misleading prompt elements. You should rigorously test your LLM's performance on diverse problem formulations, including counterintuitive scenarios and disguised variants, to mitigate risks associated with token bias and heuristic reasoning.

Key insights

Current LLMs struggle with probabilistic reasoning, especially on counterintuitive problems and when faced with subtle biases or misleading prompts.

Principles

LLM probabilistic reasoning is not robust.
Heuristic reasoning affects LLM performance.
Prompt formulation significantly impacts accuracy.

Method

A controlled benchmarking study used standard and counterintuitive discrete probability problems to evaluate 8 LLMs with and without Chain-of-Thought prompting.

In practice

Test LLMs with disguised problem variants.
Avoid misleading suggestions in prompts.
Scrutinize LLM outputs on probabilistic tasks.

Topics

Large Language Models
Probabilistic Reasoning
Chain-of-Thought Prompting
Prompt Engineering
Model Evaluation

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.