Benchmarking Large Language Models on Floating-Point Error Classification

2026-06-30 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, quick

Summary

A new benchmark, InterFLOPBench, evaluates Large Language Models' (LLMs) ability to statically detect and classify floating-point errors in C code. Comprising 90 C kernels and 1,130 test samples, InterFLOPBench assesses LLMs across six error categories: cancellation, comparison, division by zero, overflow, underflow, and NaN. The evaluation framework frames error detection as a multi-label classification problem, utilizing the F1-score metric. Recent models, including Qwen 3 32b, Gemini 2.5 Flash, Phi 4 Reasoning, DeepSeek R1T2, and gpt-oss 20b and 120b, demonstrated strong performance, achieving an overall F1-score greater than 0.88. However, performance varied significantly by error type, with explicit operations like division by zero averaging 0.8479 F1-score, while subtle numerical phenomena such as underflow and cancellation scored lower, at 0.6059 and 0.6164 F1-score respectively.

Key takeaway

For software engineers integrating LLMs into static analysis workflows, you should consider these models highly effective for detecting explicit floating-point errors like division by zero. However, be aware that LLMs currently struggle more with subtle numerical issues such as underflow and cancellation, requiring your team to implement additional, specialized checks or human review for these categories. Prioritize LLM deployment for high-confidence error types first.

Key insights

LLMs can effectively classify floating-point errors in C code, though performance varies by error type.

Principles

LLMs show strong capability in static code analysis for numerical errors.
Error classification performance differs significantly across floating-point error categories.
Benchmarking LLMs for specific code analysis tasks requires tailored datasets.

Method

InterFLOPBench evaluates LLMs on floating-point error detection as a multi-label classification problem using F1-score across 90 C kernels and 1,130 test samples.

In practice

Use LLMs for initial static analysis of C code for numerical issues.
Prioritize human review for subtle errors like underflow and cancellation.
Integrate LLM-based tools into CI/CD pipelines for early detection.

Topics

Large Language Models
Floating-Point Errors
Static Code Analysis
InterFLOPBench
Software Benchmarking
Numerical Stability

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.