Less Is More: Cognitive Load and the Single-Prompt Ceiling in LLM Mathematical Reasoning

2026-04-20 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

A systematic empirical study investigated prompt engineering for formal mathematical reasoning in the SAIR Equational Theories Stage 1 competition, a task involving deciding equational law implications over magmas. Researchers designed and tested over 40 prompt variants, ranging from 0 to 4,878 bytes, across four evaluation splits and three language models: gpt-oss-120b, Llama 3.3 70B, and Gemma 4 31B. The core finding is a "single-prompt ceiling," where balanced hard accuracy for gpt-oss-120b plateaued between 60-79%, a modest improvement over the 59.75% no-cheatsheet baseline. This ceiling is attributed to the mathematical undecidability of the TRUE case, performance degradation in weaker models (Llama 3.3 70B) with prompts over 2KB, and fragile prompt ordering effects. The top submission, AN45c (2,252 bytes), achieved 79.25% accuracy on hard3, with 95.9% TRUE recall and 63.4% FALSE recall.

Key takeaway

For AI Engineers developing LLM-based mathematical reasoning systems, recognize that prompt engineering has diminishing returns due to inherent mathematical limits and model sensitivities. You should prioritize prompt efficiency and test for performance degradation with increasing prompt size, especially when deploying on smaller models like Llama 3.3 70B, to avoid significant drops in recall for specific logical outcomes.

Key insights

LLM mathematical reasoning accuracy plateaus despite extensive prompt engineering, limited by undecidability and prompt complexity.

Principles

Mathematical undecidability limits prompt encoding.
Complex prompts degrade weaker model performance.
Prompt ordering effects are non-monotonic.

Method

The study involved designing and testing over 40 prompt variants (0-4,878 bytes) across four evaluation splits and three LLMs (gpt-oss-120b, Llama 3.3 70B, Gemma 4 31B) for equational reasoning.

In practice

Avoid overly complex prompts for smaller LLMs.
Test prompt ordering for non-monotonic effects.

Topics

Prompt Engineering
LLM Mathematical Reasoning
SAIR Equational Theories
Single-Prompt Ceiling
Model Performance Evaluation

Code references

israelcazares/sair-prompt-engineering

Best for: Research Scientist, AI Engineer, Machine Learning Engineer, AI Scientist, Prompt Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.