Impact of modelling assumptions on time horizon results
Summary
METR's time horizon (TH) task suite results are increasingly sensitive to analysis choices, as evidenced by a recent regularization fix on 2026/03/03 that decreased recent models' 50% TH by up to 20%. This analysis explores other key uncertainty sources, including task distribution, success-rate curve modeling, private versus public tasks, and noisy task length estimates. For top-performing models like Opus 4.6, alternative success-rate curve models showed a ~1.5x variation in 50% TH and 2x in 80% TH. Excluding public tasks reduced Opus 4.6's 50% TH by 40% to 7h 11m. Furthermore, noise in task length estimates, analyzed using the SIMEX technique, could reduce Opus 4.6's 50% TH by 25-40% for the current logistic model, while potentially increasing its 80% TH by 9-23%. The task distribution remains the most significant source of uncertainty.
Key takeaway
For AI Scientists and Machine Learning Engineers evaluating LLM capabilities, you should critically assess the underlying modeling assumptions of reported time horizon metrics. Be aware that factors like task distribution, success-rate curve choices, and task length noise can significantly alter 50% and 80% time horizon estimates. Consider the robustness of results by testing alternative models or data subsets, especially for top-performing models near the task suite's edge, to avoid over-anchoring on single point estimates.
Key insights
LLM time horizon estimates are highly sensitive to modeling choices and data noise, especially for frontier models.
Principles
- LLM time horizon estimates are highly sensitive to underlying modeling assumptions.
- Task distribution and noise in task length estimates are primary uncertainty sources.
- Logistic models are sensitive to LLM performance on very easy tasks.
Method
The SIMEX (SIMulate and EXtrapolate) technique estimates noise impact by simulating additional noise in datasets, fitting models, and extrapolating trends back to a "noiseless" state.
In practice
- Evaluate alternative success-rate curve models for robustness.
- Assess the impact of private vs. public task data on model performance.
- Consider capping task lengths to mitigate overestimation biases.
Topics
- LLM Evaluation
- Time Horizon Metric
- Statistical Modeling
- Uncertainty Quantification
- Claude Opus 4.6
- SIMEX
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by METR.