Feedback request: Testing the $H_{dp}$ bandwidth bound on LLM benchmarks (Preprint check & review)
Summary
A preprint empirically tests the $H_{dp}$ bandwidth bound, which posits that Chain-of-Thought (CoT) benefits only tasks exceeding a transformer's single-pass capacity. The study compares direct-answer versus 2048-token CoT conditions across Qwen-2.5 (7B/32B) and Llama-3.1-8B models on various benchmarks. Findings show CoT is essential for high-depth P-complete tasks like GSM8K and MATH, yielding +54 to +68 pp accuracy gains. For shallow TC$0$ tasks (MMLU, ARC), CoT offers negligible changes (0.0 to +4.6 pp). Intermediate L-class tasks like HumanEval reveal a sharp capacity transition: Qwen-32B gains +68.9 pp, while Qwen-7B incurs a -27.4 pp penalty. A V3 correction updated HumanEval scores, showing a +23.2 pp boost for 32B and a -28.7 pp penalty for 7B, reinforcing CoT as an architectural bandwidth bypass.
Key takeaway
For ML Engineers optimizing LLM inference, understand that Chain-of-Thought (CoT) is not universally beneficial. If your tasks are shallow (TC$0$), CoT adds no value and can even degrade performance on smaller models for intermediate tasks like HumanEval. Prioritize direct answers for simple tasks and reserve CoT for genuinely high-depth problems where single-pass capacity is a bottleneck, especially with larger models. This approach will help you avoid unnecessary token generation and potential accuracy drops.
Key insights
The $H_{dp}$ bandwidth bound predicts CoT benefits only when task depth exceeds a transformer's single-pass capacity.
Principles
- CoT acts as an architectural bandwidth bypass, not a universal enhancer.
- Task computational depth dictates the utility of Chain-of-Thought.
- Smaller LLMs can suffer performance penalties from CoT on intermediate tasks.
Method
Empirical testing of the $H_{dp}$ bandwidth bound by comparing direct-answer vs. 2048-token CoT conditions across different task depths (P-complete, TC$0$, L-class) using Qwen-2.5 and Llama-3.1-8B.
In practice
- Avoid CoT for shallow TC$0$ tasks like MMLU.
- Employ CoT for high-depth P-complete tasks such as GSM8K.
- Carefully evaluate CoT for intermediate L-class tasks like HumanEval.
Topics
- LLM Benchmarking
- Chain-of-Thought
- Transformer Architecture
- Model Capacity
- Qwen-2.5
- Llama-3.1
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.