QuantSightBench: Evaluating LLM Quantitative Forecasting with Prediction Intervals

2026-04-21 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

QuantSightBench is a new benchmark designed to evaluate large language models' (LLMs) quantitative forecasting capabilities using prediction intervals, moving beyond traditional binary or multiple-choice formats. The benchmark assesses LLMs on numerical estimates over continuous quantities, a critical skill for domains like economics and public health. It evaluates models across three settings: zero-shot, background-context, and agentic (with retrieval tools), measuring empirical coverage and interval sharpness. The study found that none of the 11 evaluated frontier and open-weight models, including Gemini 3.1 Pro, Grok 4, and GPT-5.4, achieved the target 90% coverage, with top performers falling at least 10 percentage points short. Calibration significantly degrades at extreme magnitudes, indicating systematic overconfidence and scale sensitivity across all models.

Key takeaway

For AI Engineers and Research Scientists developing or deploying LLMs for quantitative decision support, you should recognize that current frontier models are systematically overconfident in numerical forecasting, particularly for extreme magnitudes. Your models will likely produce prediction intervals that are too narrow, risking misinformed decisions. Prioritize improving calibration and scale awareness in your LLM applications, and explicitly prompt for confidence levels to enhance interval quality.

Key insights

LLMs consistently exhibit overconfidence in quantitative forecasting, failing to achieve target prediction interval coverage.

Principles

Prediction intervals offer a robust evaluation format for numerical forecasting.
Scale awareness is a critical bottleneck for LLM forecasting performance.
Explicit confidence level instructions improve LLM interval quality.

Method

QuantSightBench evaluates LLMs using prediction intervals across zero-shot, background-context, and agentic settings, employing Coverage and Mean Log Interval Score (MLIS) metrics to assess calibration and sharpness.

In practice

Specify confidence levels in prompts for better LLM interval calibration.
Provide relevant background context to improve forecasting performance.
Consider increased reasoning effort for weaker models to enhance calibration.

Topics

LLM Quantitative Forecasting
Prediction Intervals
QuantSightBench
Model Calibration
Systematic Overconfidence

Code references

aisa-group/quantsightbench

Best for: AI Engineer, Research Scientist, AI Product Manager, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.