ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

ASyMOB, a novel assessment framework, introduces 17,092 unique math challenges to evaluate large language models' (LLMs) proficiency in university-level symbolic mathematics, including integration, limits, and differential equations. The benchmark reveals that LLMs, even high-performing ones like o4-mini (96.8% on unperturbed set) and Gemini 2.5 Flash (97.6%), exhibit substantial performance degradation (up to -70.3%) when faced with perturbed problems, suggesting reliance on memorized patterns. However, frontier models demonstrate remarkable robustness against these perturbations (-21.7% and -21.2% vs. average -50.4% for others). The study also identifies instances where computer algebra systems (CAS) fail while LLMs succeed, and shows that integrated code execution improves weaker models by up to +33.1%. Hybrid LLM+CAS strategies can solve problems neither system can address alone.

Key takeaway

For AI Scientists and Machine Learning Engineers developing or deploying LLMs for scientific applications, this research underscores the need to rigorously test models against perturbed symbolic math problems. While frontier models like o4-mini and Gemini 2.5 Flash show promising generalization, integrating code execution or adopting hybrid LLM+CAS strategies can significantly enhance reliability for other models, especially when tackling complex university-level symbolic tasks. Focus on robustness to perturbations, not just baseline accuracy, to ensure genuine mathematical understanding.

Key insights

LLMs struggle with symbolic math perturbations, indicating pattern memorization, though frontier models show surprising robustness.

Principles

Method

ASyMOB generates 17,092 unique challenges by systematically perturbing seed university-level symbolic math problems with numerical, symbolic, and equivalence variations, then validates answers using dual symbolic and numerical checks.

In practice

Topics

Code references

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.