Feedback request: Testing the $H_{dp}$ bandwidth bound on LLM benchmarks (Preprint check & review)

2026-05-29 · Source: Machine Learning ML & Generative AI News · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A preprint empirically tests the $H_{dp}$ bandwidth bound, which posits that Chain-of-Thought (CoT) benefits only tasks exceeding a transformer's single-pass capacity. The study compares direct-answer versus 2048-token CoT conditions across Qwen-2.5 (7B/32B) and Llama-3.1-8B models on various benchmarks. Findings show CoT is essential for high-depth P-complete tasks like GSM8K and MATH, yielding +54 to +68 pp accuracy gains. For shallow TC$0$ tasks (MMLU, ARC), CoT offers negligible changes (0.0 to +4.6 pp). Intermediate L-class tasks like HumanEval reveal a sharp capacity transition: Qwen-32B gains +68.9 pp, while Qwen-7B incurs a -27.4 pp penalty. A V3 correction updated HumanEval scores, showing a +23.2 pp boost for 32B and a -28.7 pp penalty for 7B, reinforcing CoT as an architectural bandwidth bypass.

Key takeaway

For ML Engineers optimizing LLM inference, understand that Chain-of-Thought (CoT) is not universally beneficial. If your tasks are shallow (TC$0$), CoT adds no value and can even degrade performance on smaller models for intermediate tasks like HumanEval. Prioritize direct answers for simple tasks and reserve CoT for genuinely high-depth problems where single-pass capacity is a bottleneck, especially with larger models. This approach will help you avoid unnecessary token generation and potential accuracy drops.

Key insights

The $H_{dp}$ bandwidth bound predicts CoT benefits only when task depth exceeds a transformer's single-pass capacity.

Principles

CoT acts as an architectural bandwidth bypass, not a universal enhancer.
Task computational depth dictates the utility of Chain-of-Thought.
Smaller LLMs can suffer performance penalties from CoT on intermediate tasks.

Method

Empirical testing of the $H_{dp}$ bandwidth bound by comparing direct-answer vs. 2048-token CoT conditions across different task depths (P-complete, TC$0$, L-class) using Qwen-2.5 and Llama-3.1-8B.

In practice

Avoid CoT for shallow TC$0$ tasks like MMLU.
Employ CoT for high-depth P-complete tasks such as GSM8K.
Carefully evaluate CoT for intermediate L-class tasks like HumanEval.

Topics

LLM Benchmarking
Chain-of-Thought
Transformer Architecture
Model Capacity
Qwen-2.5
Llama-3.1

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.