Testing LLM Arithmetic Reasoning Generalization with Automatic Numeric-Remapping Attacks
Summary
A new study introduces an automatic algorithm for generating numeric-remapping attacks to test large language models' (LLMs) arithmetic reasoning generalization. This method, unlike template-based approaches, derives problem-specific symbolic representations, generates constrained numeric remappings, recomputes gold answers, and transforms questions via LLM-guided deterministic edits. The pipeline, validated stage-wise, aims to assess LLM fragility under small, schema-preserving numeric changes that retain the original reasoning program. Evaluating DeepSeek-R1 (70B), Gemma4 (31B), and GPT-OSS (120B) on GSM8K, MAWPS, and MultiArith, the research found significant conditional accuracy drops of 12.16 to 25.82 percentage points on GSM8K. Conversely, MAWPS and MultiArith showed high stability, with attacked accuracies near or above 98%. These results indicate that numeric-remapping robustness is strongly influenced by dataset structure, highlighting GSM8K's sensitivity even when reasoning programs are preserved.
Key takeaway
For machine learning engineers evaluating LLM robustness in arithmetic reasoning, relying solely on standard benchmarks may mask significant fragility. Your models, even those performing well on original problems, can fail on structurally similar variants with small numeric changes. You should integrate automatic numeric-remapping attacks into your evaluation pipelines, especially for models deployed in natural language reasoning contexts, to uncover dataset-dependent vulnerabilities and ensure more trustworthy arithmetic capabilities.
Key insights
LLMs exhibit significant arithmetic reasoning fragility even with small, schema-preserving numeric changes, particularly on complex datasets like GSM8K.
Principles
- LLM numeric-remapping robustness depends strongly on dataset structure.
- LLMs are sensitive to numerical variation even with preserved reasoning programs.
Method
The automatic algorithm derives problem-specific symbolic representations, generates constrained numeric remappings, recomputes gold answers, and uses LLM-generated edit plans for deterministic edits.
In practice
- Use numeric-remapping attacks to stress-test LLM arithmetic.
- Identify dataset structures that expose LLM reasoning fragility.
Topics
- Large Language Models
- Arithmetic Reasoning
- LLM Robustness
- Generalization Testing
- Numeric-Remapping Attacks
- LLM Evaluation
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.