Less Is More: Cognitive Load and the Single-Prompt Ceiling in LLM Mathematical Reasoning

2026-04-22 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, long

Summary

A systematic empirical study on prompt engineering for formal mathematical reasoning in the SAIR Equational Theories Stage 1 competition reveals a "single-prompt ceiling" for LLMs. Researchers tested over 40 prompt variants, ranging from 0 to 4,878 bytes, across four evaluation splits and three models: gpt-oss-120b, Llama 3.3 70B, and Gemma 4 31B. The core finding is that balanced hard accuracy plateaus at 60–79% for gpt-oss-120b, a +19.5 percentage-point improvement over the 59.75% no-cheatsheet baseline, but further engineering yields unstable and non-generalizable improvements. This saturation is attributed to the undecidability of "True" cases, performance collapse in weaker models with complex prompts (Llama 3.3 70B drops to 0% True recall with prompts >2KB), and fragile prompt ordering effects. A key design decision, placing the trivial-magma check before the counterexample table, accounted for the primary performance gain.

Key takeaway

For AI Engineers developing LLM solutions for formal mathematical reasoning, recognize that a "single-prompt ceiling" limits performance gains. You should prioritize concise, well-ordered prompts over complex ones, especially when targeting weaker models or multi-model deployment, as excessive complexity can degrade accuracy and generalization. Validate your prompt designs against diverse, balanced problem distributions to avoid catastrophic failures observed with distribution mismatches.

Key insights

LLM mathematical reasoning hits a "single-prompt ceiling" where prompt complexity yields diminishing, unstable returns.

Principles

Prompt complexity inversely relates to multi-model generalization.
Distribution mismatch can lead to catastrophic performance failures.
Prompt ordering significantly impacts model attention and performance.

Method

Systematic ablation study of 40+ prompt variants on labeled dataset splits, analyzing performance across gpt-oss-120b, Llama 3.3 70B, and Gemma 4 31B for equational implication tasks.

In practice

Prioritize minimal effective prompts for multi-model deployment.
Validate prompts on balanced datasets, not just False-heavy subsets.
Consider placing critical instructions at the beginning or end of prompts.

Topics

LLM Formal Reasoning
Prompt Engineering
Equational Theories
Single-Prompt Ceiling
Cognitive Load Collapse

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Prompt Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.