Impact of modelling assumptions on time horizon results

2026-03-20 · Source: METR · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Research Methodology & Innovation · Depth: Advanced, extended

Summary

METR's time horizon (TH) task suite results are increasingly sensitive to analysis choices, as evidenced by a recent regularization fix on 2026/03/03 that decreased recent models' 50% TH by up to 20%. This analysis explores other key uncertainty sources, including task distribution, success-rate curve modeling, private versus public tasks, and noisy task length estimates. For top-performing models like Opus 4.6, alternative success-rate curve models showed a ~1.5x variation in 50% TH and 2x in 80% TH. Excluding public tasks reduced Opus 4.6's 50% TH by 40% to 7h 11m. Furthermore, noise in task length estimates, analyzed using the SIMEX technique, could reduce Opus 4.6's 50% TH by 25-40% for the current logistic model, while potentially increasing its 80% TH by 9-23%. The task distribution remains the most significant source of uncertainty.

Key takeaway

For AI Scientists and Machine Learning Engineers evaluating LLM capabilities, you should critically assess the underlying modeling assumptions of reported time horizon metrics. Be aware that factors like task distribution, success-rate curve choices, and task length noise can significantly alter 50% and 80% time horizon estimates. Consider the robustness of results by testing alternative models or data subsets, especially for top-performing models near the task suite's edge, to avoid over-anchoring on single point estimates.

Key insights

LLM time horizon estimates are highly sensitive to modeling choices and data noise, especially for frontier models.

Principles

LLM time horizon estimates are highly sensitive to underlying modeling assumptions.
Task distribution and noise in task length estimates are primary uncertainty sources.
Logistic models are sensitive to LLM performance on very easy tasks.

Method

The SIMEX (SIMulate and EXtrapolate) technique estimates noise impact by simulating additional noise in datasets, fitting models, and extrapolating trends back to a "noiseless" state.

In practice

Evaluate alternative success-rate curve models for robustness.
Assess the impact of private vs. public task data on model performance.
Consider capping task lengths to mitigate overestimation biases.

Topics

LLM Evaluation
Time Horizon Metric
Statistical Modeling
Uncertainty Quantification
Claude Opus 4.6
SIMEX

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by METR.