UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs
Summary
UnpredictaBench introduces a new benchmark for evaluating large language models' (LLMs) ability to capture true underlying probability distributions. This is crucial for LLMs used in simulations requiring real-world unpredictability. The benchmark includes 448 problems, spanning canonical statistical distributions, stochastic programs, and natural-language scenarios. It utilizes a novel metric, $KS@N$, which quantifies how well model outputs approximate target distributions. This metric measures the non-rejection rate of model samples (size N) against ground-truth samples via the Kolmogorov-Smirnov test. Testing across open and proprietary models revealed a significant spread in capabilities. No model achieved over 40% at $KS@100$. Nemotron-3 Super 120B led with 32.64%, followed by GPT-4o (23.90%) and DeepSeek V3.2 (21.73%). Notably, Qwen3.5 2B, a smaller model, scored 17.67%. While reasoning offers some improvement, no immediate solution for this fundamental challenge was found.
Key takeaway
For ML engineers developing LLM-driven simulation models, recognize that current models significantly struggle with distributional randomness. You should rigorously evaluate your chosen LLM's stochastic fidelity using metrics like $KS@N$. Single-sample plausibility is misleading. Do not assume instruction-tuned models will perform better; they often reduce necessary output diversity. Consider specialized models or techniques to improve calibrated sampling for accurate system simulations.
Key insights
LLMs consistently struggle to faithfully sample from target probability distributions, hindering their utility in stochastic simulations.
Principles
- LLM reasoning about distributions does not translate to faithful generation.
- Instruction tuning provides minimal benefit, often reducing output diversity.
- Model scale does not guarantee distributional fidelity; smaller models can outperform.
Method
UnpredictaBench evaluates LLMs by comparing 100 independent model samples against 10,000 ground-truth samples using the $KS@N$ metric, which quantifies the non-rejection rate via the Kolmogorov-Smirnov test.
In practice
- Use $KS@N$ to assess LLM distributional accuracy for simulation tasks.
- Prioritize models like Nemotron-3 Super 120B for stochastic generation.
- Be wary of instruction-tuned models for tasks requiring high output diversity.
Topics
- Large Language Models
- LLM Evaluation
- Distributional Randomness
- Stochastic Simulation
- Kolmogorov-Smirnov Test
- UnpredictaBench
- Nemotron-3 Super 120B
Code references
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.