ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark
Summary
ASyMOB, a novel assessment framework, introduces 17,092 unique math challenges to evaluate large language models' (LLMs) proficiency in university-level symbolic mathematics, including integration, limits, and differential equations. The benchmark reveals that LLMs, even high-performing ones like o4-mini (96.8% on unperturbed set) and Gemini 2.5 Flash (97.6%), exhibit substantial performance degradation (up to -70.3%) when faced with perturbed problems, suggesting reliance on memorized patterns. However, frontier models demonstrate remarkable robustness against these perturbations (-21.7% and -21.2% vs. average -50.4% for others). The study also identifies instances where computer algebra systems (CAS) fail while LLMs succeed, and shows that integrated code execution improves weaker models by up to +33.1%. Hybrid LLM+CAS strategies can solve problems neither system can address alone.
Key takeaway
For AI Scientists and Machine Learning Engineers developing or deploying LLMs for scientific applications, this research underscores the need to rigorously test models against perturbed symbolic math problems. While frontier models like o4-mini and Gemini 2.5 Flash show promising generalization, integrating code execution or adopting hybrid LLM+CAS strategies can significantly enhance reliability for other models, especially when tackling complex university-level symbolic tasks. Focus on robustness to perturbations, not just baseline accuracy, to ensure genuine mathematical understanding.
Key insights
LLMs struggle with symbolic math perturbations, indicating pattern memorization, though frontier models show surprising robustness.
Principles
- LLM symbolic math performance degrades significantly with perturbations.
- Frontier LLMs exhibit superior robustness to symbolic perturbations.
- Hybrid LLM+CAS strategies can overcome individual system limitations.
Method
ASyMOB generates 17,092 unique challenges by systematically perturbing seed university-level symbolic math problems with numerical, symbolic, and equivalence variations, then validates answers using dual symbolic and numerical checks.
In practice
- Test LLMs with perturbed inputs to assess generalization.
- Integrate code execution for weaker LLMs to boost math performance.
- Combine LLM strategic ability with CAS rigor for complex problems.
Topics
- Large Language Models
- Symbolic Mathematics
- LLM Benchmarking
- Mathematical Reasoning
- Tool Use
- Computer Algebra Systems
- Generalization Capabilities
Code references
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.