UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs

2026-04-20 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

UnpredictaBench introduces a new benchmark for evaluating large language models' (LLMs) ability to capture true underlying probability distributions. This is crucial for LLMs used in simulations requiring real-world unpredictability. The benchmark includes 448 problems, spanning canonical statistical distributions, stochastic programs, and natural-language scenarios. It utilizes a novel metric, $KS@N$, which quantifies how well model outputs approximate target distributions. This metric measures the non-rejection rate of model samples (size N) against ground-truth samples via the Kolmogorov-Smirnov test. Testing across open and proprietary models revealed a significant spread in capabilities. No model achieved over 40% at $KS@100$. Nemotron-3 Super 120B led with 32.64%, followed by GPT-4o (23.90%) and DeepSeek V3.2 (21.73%). Notably, Qwen3.5 2B, a smaller model, scored 17.67%. While reasoning offers some improvement, no immediate solution for this fundamental challenge was found.

Key takeaway

For ML engineers developing LLM-driven simulation models, recognize that current models significantly struggle with distributional randomness. You should rigorously evaluate your chosen LLM's stochastic fidelity using metrics like $KS@N$. Single-sample plausibility is misleading. Do not assume instruction-tuned models will perform better; they often reduce necessary output diversity. Consider specialized models or techniques to improve calibrated sampling for accurate system simulations.

Key insights

LLMs consistently struggle to faithfully sample from target probability distributions, hindering their utility in stochastic simulations.

Principles

LLM reasoning about distributions does not translate to faithful generation.
Instruction tuning provides minimal benefit, often reducing output diversity.
Model scale does not guarantee distributional fidelity; smaller models can outperform.

Method

UnpredictaBench evaluates LLMs by comparing 100 independent model samples against 10,000 ground-truth samples using the $KS@N$ metric, which quantifies the non-rejection rate via the Kolmogorov-Smirnov test.

In practice

Use $KS@N$ to assess LLM distributional accuracy for simulation tasks.
Prioritize models like Nemotron-3 Super 120B for stochastic generation.
Be wary of instruction-tuned models for tasks requiring high output diversity.

Topics

Large Language Models
LLM Evaluation
Distributional Randomness
Stochastic Simulation
Kolmogorov-Smirnov Test
UnpredictaBench
Nemotron-3 Super 120B

Code references

UnpredictaBench/UnpredictaBenchCode

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.