PyraMathBench: Evaluating and Improving Mathematical Capability in Large Language Models
Summary
PyraMathBench is a new hierarchical benchmark designed to evaluate and improve large language models' (LLMs) mathematical capabilities, specifically addressing numerical reasoning. Comprising 32,505 questions derived from 7,404 math word problems, it covers 4 cognitive aspects, 14 subcategories, and 2 modalities. Experiments using PyraMathBench reveal that LLMs perform poorly due to inadequate numerical computation and weak handling of abstract numerical questions. To mitigate these issues, the Smart Optimization & Learning-based VErsatile module (SOLVE) and Interactive Relative Policy Optimization (IRPO) are proposed. These modules enhance LLMs' numerical-mathematical synergy via efficient tool calls, including fuzzy matching and low-quality call rejection. Training with SOLVE and IRPO improved Qwen-2.5's score by 5.0 points.
Key takeaway
For machine learning engineers evaluating or fine-tuning LLMs for mathematical tasks, you should recognize that numerical computation and abstract numerical question handling are critical failure points. Utilize benchmarks like PyraMathBench to pinpoint these specific weaknesses in your models. Implementing modules that enhance numerical-mathematical synergy through efficient tool calls, such as SOLVE and IRPO, can significantly improve your LLM's performance, as demonstrated by Qwen-2.5's 5.0-point score increase.
Key insights
LLMs' mathematical failures stem from poor numerical computation and abstract numerical handling, necessitating integrated benchmarks and specialized modules.
Principles
- LLM math benchmarks must integrate numerical processing and reasoning.
- Inadequate numerical computation severely compromises LLM math performance.
- Abstract numerical questions are a significant weakness for LLMs.
Method
The SOLVE and IRPO modules enhance LLMs' numerical-mathematical synergy by employing efficient tool calls, specifically fuzzy matching and low-quality call rejection.
In practice
- Use PyraMathBench to diagnose specific numerical and abstract math weaknesses in LLMs.
- Consider integrating tool-calling modules like SOLVE/IRPO for math-focused LLMs.
Topics
- PyraMathBench
- Large Language Models
- Mathematical Reasoning
- Numerical Computation
- LLM Evaluation
- Tool Use
- Qwen-2.5
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.