Less Is More: Cognitive Load and the Single-Prompt Ceiling in LLM Mathematical Reasoning
Summary
A systematic empirical study investigated prompt engineering for formal mathematical reasoning in the SAIR Equational Theories Stage 1 competition, a task involving deciding equational law implications over magmas. Researchers designed and tested over 40 prompt variants, ranging from 0 to 4,878 bytes, across four evaluation splits and three language models: gpt-oss-120b, Llama 3.3 70B, and Gemma 4 31B. The core finding is a "single-prompt ceiling," where balanced hard accuracy for gpt-oss-120b plateaued between 60-79%, a modest improvement over the 59.75% no-cheatsheet baseline. This ceiling is attributed to the mathematical undecidability of the TRUE case, performance degradation in weaker models (Llama 3.3 70B) with prompts over 2KB, and fragile prompt ordering effects. The top submission, AN45c (2,252 bytes), achieved 79.25% accuracy on hard3, with 95.9% TRUE recall and 63.4% FALSE recall.
Key takeaway
For AI Engineers developing LLM-based mathematical reasoning systems, recognize that prompt engineering has diminishing returns due to inherent mathematical limits and model sensitivities. You should prioritize prompt efficiency and test for performance degradation with increasing prompt size, especially when deploying on smaller models like Llama 3.3 70B, to avoid significant drops in recall for specific logical outcomes.
Key insights
LLM mathematical reasoning accuracy plateaus despite extensive prompt engineering, limited by undecidability and prompt complexity.
Principles
- Mathematical undecidability limits prompt encoding.
- Complex prompts degrade weaker model performance.
- Prompt ordering effects are non-monotonic.
Method
The study involved designing and testing over 40 prompt variants (0-4,878 bytes) across four evaluation splits and three LLMs (gpt-oss-120b, Llama 3.3 70B, Gemma 4 31B) for equational reasoning.
In practice
- Avoid overly complex prompts for smaller LLMs.
- Test prompt ordering for non-monotonic effects.
Topics
- Prompt Engineering
- LLM Mathematical Reasoning
- SAIR Equational Theories
- Single-Prompt Ceiling
- Model Performance Evaluation
Code references
Best for: Research Scientist, AI Engineer, Machine Learning Engineer, AI Scientist, Prompt Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.