Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models
Summary
A new paper by Francis Rhys Ward, Dewi Gould, Anders Cairns Woodruff et al. quantifies the "no-CoT" (no chain of thought) task-completion time horizons of frontier AI models, investigating how long models can reason internally without externalizing their thought process. The research, extending Ryan Greenblatt's work, evaluated 14 models from GPT-2 (2019) to GPT-5.5 (2026) across 43 benchmarks spanning math, coding, knowledge, and agentic tool-use. It found that frontier models like GPT-5.5 can achieve a 50% success rate on tasks that take humans approximately three minutes to complete, with this "time horizon" doubling every 373 days (95% CI: 167–691) since 2019. This no-CoT reasoning capability is projected to reach ~7 minutes/~3.7k tokens by 2028 and ~25 minutes/~12k tokens by 2030, raising safety concerns about models' motivations and potential for "scheming" without human oversight.
Key takeaway
For AI Security Engineers monitoring model behavior, you should explicitly track no-CoT Time Horizons to establish a lower bound on unobservable reasoning. This metric is crucial because models capable of 25 minutes of latent reasoning by 2030 could enable significant subversion, making traditional CoT monitoring less effective. Integrate cheap no-CoT evaluations into your safety protocols to proactively identify risks from increasingly opaque model capabilities.
Key insights
Frontier AI models' ability to reason internally without Chain of Thought (CoT) is growing exponentially, posing safety risks.
Principles
- No-CoT reasoning capability doubles approximately every year.
- Post-GPT-4, CoT-enabled gains outpace no-CoT gains.
- Opaque reasoning capability is a critical safety metric.
Method
Models are evaluated on 43 benchmarks, preventing CoT output. 50% success rate time horizon is estimated using human solve time and o3-mini reasoning token anchors, fitting a logistic curve.
In practice
- Track no-CoT Time Horizons as a safety lower bound.
- Use benchmark-specific prompts and structured-output constraints.
Topics
- AI Safety
- Chain of Thought
- Frontier AI Models
- Model Evaluation
- Latent Reasoning
- GPT-5.5
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Redwood Research blog.