OpenDeepThink: Parallel Reasoning via Bradley--Terry Aggregation

2026-05-14 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, quick

Summary

OpenDeepThink is a new population-based test-time compute framework designed to enhance Large Language Model (LLM) reasoning by scaling breadth through parallel candidate sampling. It addresses the challenge of selecting the best candidate without ground-truth verification by employing pairwise Bradley-Terry comparison. In each generation, the LLM judges random pairs of candidates, aggregates votes into a global ranking, preserves top-ranked candidates, and mutates the top three quarters using natural-language critiques. The bottom quarter is discarded. This method improved Gemini 3.1 Pro's effective Codeforces Elo by +405 points over eight sequential LLM-call rounds, taking approximately 27 minutes. The pipeline is adaptable across various LLM strengths without retuning and shows gains primarily in objectively verifiable domains on the multi-domain HLE benchmark. A new dataset, CF-73, comprising 73 expert-rated Codeforces problems with International Grandmaster annotation, is also released.

Key takeaway

For AI Engineers optimizing LLM reasoning, consider implementing population-based methods like OpenDeepThink to scale breadth rather than just depth. Your teams can leverage pairwise Bradley-Terry comparisons to aggregate LLM judgments, improving candidate selection and achieving significant performance gains, particularly in domains with objective verifiability. This approach can enhance model performance without extensive retuning.

Key insights

Bradley-Terry comparison effectively aggregates noisy LLM judgments for parallel reasoning candidate selection.

Principles

Scaling breadth improves LLM reasoning.
Pairwise comparison reduces LLM judging bias.

Method

OpenDeepThink uses LLM pairwise judgments, Bradley-Terry aggregation for ranking, and natural-language critiques to mutate top candidates, discarding lower-ranked ones.

In practice

Apply Bradley-Terry for LLM self-evaluation.
Use natural-language critiques for mutation.
Focus on objectively verifiable domains.

Topics

OpenDeepThink
Bradley-Terry Aggregation
LLM Reasoning
Parallel Reasoning
Codeforces Elo

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.