Less Is More: Cognitive Load and the Single-Prompt Ceiling in LLM Mathematical Reasoning
Summary
A systematic empirical study on prompt engineering for formal mathematical reasoning in the SAIR Equational Theories Stage 1 competition reveals a "single-prompt ceiling" for LLMs. Researchers tested over 40 prompt variants, ranging from 0 to 4,878 bytes, across four evaluation splits and three models: gpt-oss-120b, Llama 3.3 70B, and Gemma 4 31B. The core finding is that balanced hard accuracy plateaus at 60–79% for gpt-oss-120b, a +19.5 percentage-point improvement over the 59.75% no-cheatsheet baseline, but further engineering yields unstable and non-generalizable improvements. This saturation is attributed to the undecidability of "True" cases, performance collapse in weaker models with complex prompts (Llama 3.3 70B drops to 0% True recall with prompts >2KB), and fragile prompt ordering effects. A key design decision, placing the trivial-magma check before the counterexample table, accounted for the primary performance gain.
Key takeaway
For AI Engineers developing LLM solutions for formal mathematical reasoning, recognize that a "single-prompt ceiling" limits performance gains. You should prioritize concise, well-ordered prompts over complex ones, especially when targeting weaker models or multi-model deployment, as excessive complexity can degrade accuracy and generalization. Validate your prompt designs against diverse, balanced problem distributions to avoid catastrophic failures observed with distribution mismatches.
Key insights
LLM mathematical reasoning hits a "single-prompt ceiling" where prompt complexity yields diminishing, unstable returns.
Principles
- Prompt complexity inversely relates to multi-model generalization.
- Distribution mismatch can lead to catastrophic performance failures.
- Prompt ordering significantly impacts model attention and performance.
Method
Systematic ablation study of 40+ prompt variants on labeled dataset splits, analyzing performance across gpt-oss-120b, Llama 3.3 70B, and Gemma 4 31B for equational implication tasks.
In practice
- Prioritize minimal effective prompts for multi-model deployment.
- Validate prompts on balanced datasets, not just False-heavy subsets.
- Consider placing critical instructions at the beginning or end of prompts.
Topics
- LLM Formal Reasoning
- Prompt Engineering
- Equational Theories
- Single-Prompt Ceiling
- Cognitive Load Collapse
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Prompt Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.