UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

UnpredictaBench introduces a new benchmark for evaluating large language models' (LLMs) ability to capture true underlying probability distributions. This is crucial for LLMs used in simulations requiring real-world unpredictability. The benchmark includes 448 problems, spanning canonical statistical distributions, stochastic programs, and natural-language scenarios. It utilizes a novel metric, $KS@N$, which quantifies how well model outputs approximate target distributions. This metric measures the non-rejection rate of model samples (size N) against ground-truth samples via the Kolmogorov-Smirnov test. Testing across open and proprietary models revealed a significant spread in capabilities. No model achieved over 40% at $KS@100$. Nemotron-3 Super 120B led with 32.64%, followed by GPT-4o (23.90%) and DeepSeek V3.2 (21.73%). Notably, Qwen3.5 2B, a smaller model, scored 17.67%. While reasoning offers some improvement, no immediate solution for this fundamental challenge was found.

Key takeaway

For ML engineers developing LLM-driven simulation models, recognize that current models significantly struggle with distributional randomness. You should rigorously evaluate your chosen LLM's stochastic fidelity using metrics like $KS@N$. Single-sample plausibility is misleading. Do not assume instruction-tuned models will perform better; they often reduce necessary output diversity. Consider specialized models or techniques to improve calibrated sampling for accurate system simulations.

Key insights

LLMs consistently struggle to faithfully sample from target probability distributions, hindering their utility in stochastic simulations.

Principles

Method

UnpredictaBench evaluates LLMs by comparing 100 independent model samples against 10,000 ground-truth samples using the $KS@N$ metric, which quantifies the non-rejection rate via the Kolmogorov-Smirnov test.

In practice

Topics

Code references

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.