Testing LLM Arithmetic Reasoning Generalization with Automatic Numeric-Remapping Attacks

2026-06-02 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, quick

Summary

A new study introduces an automatic algorithm for generating numeric-remapping attacks to test large language models' (LLMs) arithmetic reasoning generalization. This method, unlike template-based approaches, derives problem-specific symbolic representations, generates constrained numeric remappings, recomputes gold answers, and transforms questions via LLM-guided deterministic edits. The pipeline, validated stage-wise, aims to assess LLM fragility under small, schema-preserving numeric changes that retain the original reasoning program. Evaluating DeepSeek-R1 (70B), Gemma4 (31B), and GPT-OSS (120B) on GSM8K, MAWPS, and MultiArith, the research found significant conditional accuracy drops of 12.16 to 25.82 percentage points on GSM8K. Conversely, MAWPS and MultiArith showed high stability, with attacked accuracies near or above 98%. These results indicate that numeric-remapping robustness is strongly influenced by dataset structure, highlighting GSM8K's sensitivity even when reasoning programs are preserved.

Key takeaway

For machine learning engineers evaluating LLM robustness in arithmetic reasoning, relying solely on standard benchmarks may mask significant fragility. Your models, even those performing well on original problems, can fail on structurally similar variants with small numeric changes. You should integrate automatic numeric-remapping attacks into your evaluation pipelines, especially for models deployed in natural language reasoning contexts, to uncover dataset-dependent vulnerabilities and ensure more trustworthy arithmetic capabilities.

Key insights

LLMs exhibit significant arithmetic reasoning fragility even with small, schema-preserving numeric changes, particularly on complex datasets like GSM8K.

Principles

LLM numeric-remapping robustness depends strongly on dataset structure.
LLMs are sensitive to numerical variation even with preserved reasoning programs.

Method

The automatic algorithm derives problem-specific symbolic representations, generates constrained numeric remappings, recomputes gold answers, and uses LLM-generated edit plans for deterministic edits.

In practice

Use numeric-remapping attacks to stress-test LLM arithmetic.
Identify dataset structures that expose LLM reasoning fragility.

Topics

Large Language Models
Arithmetic Reasoning
LLM Robustness
Generalization Testing
Numeric-Remapping Attacks
LLM Evaluation

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.