Less Is More: Cognitive Load and the Single-Prompt Ceiling in LLM Mathematical Reasoning

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

A systematic empirical study investigated prompt engineering for formal mathematical reasoning in the SAIR Equational Theories Stage 1 competition, a task involving deciding equational law implications over magmas. Researchers designed and tested over 40 prompt variants, ranging from 0 to 4,878 bytes, across four evaluation splits and three language models: gpt-oss-120b, Llama 3.3 70B, and Gemma 4 31B. The core finding is a "single-prompt ceiling," where balanced hard accuracy for gpt-oss-120b plateaued between 60-79%, a modest improvement over the 59.75% no-cheatsheet baseline. This ceiling is attributed to the mathematical undecidability of the TRUE case, performance degradation in weaker models (Llama 3.3 70B) with prompts over 2KB, and fragile prompt ordering effects. The top submission, AN45c (2,252 bytes), achieved 79.25% accuracy on hard3, with 95.9% TRUE recall and 63.4% FALSE recall.

Key takeaway

For AI Engineers developing LLM-based mathematical reasoning systems, recognize that prompt engineering has diminishing returns due to inherent mathematical limits and model sensitivities. You should prioritize prompt efficiency and test for performance degradation with increasing prompt size, especially when deploying on smaller models like Llama 3.3 70B, to avoid significant drops in recall for specific logical outcomes.

Key insights

LLM mathematical reasoning accuracy plateaus despite extensive prompt engineering, limited by undecidability and prompt complexity.

Principles

Method

The study involved designing and testing over 40 prompt variants (0-4,878 bytes) across four evaluation splits and three LLMs (gpt-oss-120b, Llama 3.3 70B, Gemma 4 31B) for equational reasoning.

In practice

Topics

Code references

Best for: Research Scientist, AI Engineer, Machine Learning Engineer, AI Scientist, Prompt Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.