PyraMathBench: Evaluating and Improving Mathematical Capability in Large Language Models

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

PyraMathBench is a new hierarchical benchmark designed to evaluate and improve large language models' (LLMs) mathematical capabilities, specifically addressing numerical reasoning. Comprising 32,505 questions derived from 7,404 math word problems, it covers 4 cognitive aspects, 14 subcategories, and 2 modalities. Experiments using PyraMathBench reveal that LLMs perform poorly due to inadequate numerical computation and weak handling of abstract numerical questions. To mitigate these issues, the Smart Optimization & Learning-based VErsatile module (SOLVE) and Interactive Relative Policy Optimization (IRPO) are proposed. These modules enhance LLMs' numerical-mathematical synergy via efficient tool calls, including fuzzy matching and low-quality call rejection. Training with SOLVE and IRPO improved Qwen-2.5's score by 5.0 points.

Key takeaway

For machine learning engineers evaluating or fine-tuning LLMs for mathematical tasks, you should recognize that numerical computation and abstract numerical question handling are critical failure points. Utilize benchmarks like PyraMathBench to pinpoint these specific weaknesses in your models. Implementing modules that enhance numerical-mathematical synergy through efficient tool calls, such as SOLVE and IRPO, can significantly improve your LLM's performance, as demonstrated by Qwen-2.5's 5.0-point score increase.

Key insights

LLMs' mathematical failures stem from poor numerical computation and abstract numerical handling, necessitating integrated benchmarks and specialized modules.

Principles

Method

The SOLVE and IRPO modules enhance LLMs' numerical-mathematical synergy by employing efficient tool calls, specifically fuzzy matching and low-quality call rejection.

In practice

Topics

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.