GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs as Mathematical Research Assistants in Graph Theory

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, quick

Summary

GTBench, a new curriculum-grounded benchmark, evaluates large language models (LLMs) as mathematical research assistants in graph theory. Comprising 63 problems from verified academic materials like Diestel's Graph Theory, it organizes tasks into three increasing difficulty groups: undergraduate definitions, algorithm tracing, and graduate-level proof construction. The benchmark assessed five frontier models—GPT-5, Claude Sonnet 4.6, Gemini 2.5 Flash-Lite, Llama 3.3 70B, and Mistral Large 3—using zero-shot and chain-of-thought prompting, with evaluation methods including exact-match, LLM-as-judge, and a hybrid human expert protocol for proofs. Results show GPT-5 nearing ceiling on Group 1 (95.8% zero-shot) and maintaining 82% accuracy on graduate proofs, while other models degrade significantly, with Llama 3.3 70B achieving 0% on Group 3 zero-shot. Failure analysis highlights "correct algorithm, wrong execution" errors and "incomplete reasoning" for proofs, alongside systematic disagreement between human and automated judges (kappa = 0.48-0.83).

Key takeaway

For AI Scientists evaluating LLMs for mathematical reasoning or developing AI tools for education and research, this benchmark reveals that current models, with the exception of GPT-5, are largely unreliable for complex tasks like graduate-level proof construction. You should anticipate significant performance degradation with increasing problem difficulty and recognize that human oversight remains critical for validating LLM outputs in these high-stakes domains, especially given the observed discrepancies between automated and human evaluations.

Key insights

LLMs struggle with mathematical reasoning, particularly graduate-level proofs, with GPT-5 significantly outperforming other models.

Principles

Method

GTBench evaluates LLMs using 63 graph theory problems across three difficulty groups, employing zero-shot/CoT prompting and a hybrid human/LLM-as-judge evaluation protocol.

In practice

Topics

Best for: AI Scientist, Research Scientist, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.