Benchmarking Large Language Models on Floating-Point Error Classification
Summary
A new benchmark, InterFLOPBench, evaluates Large Language Models' (LLMs) ability to statically detect and classify floating-point errors in C code. Comprising 90 C kernels and 1,130 test samples, InterFLOPBench assesses LLMs across six error categories: cancellation, comparison, division by zero, overflow, underflow, and NaN. The evaluation framework frames error detection as a multi-label classification problem, utilizing the F1-score metric. Recent models, including Qwen 3 32b, Gemini 2.5 Flash, Phi 4 Reasoning, DeepSeek R1T2, and gpt-oss 20b and 120b, demonstrated strong performance, achieving an overall F1-score greater than 0.88. However, performance varied significantly by error type, with explicit operations like division by zero averaging 0.8479 F1-score, while subtle numerical phenomena such as underflow and cancellation scored lower, at 0.6059 and 0.6164 F1-score respectively.
Key takeaway
For software engineers integrating LLMs into static analysis workflows, you should consider these models highly effective for detecting explicit floating-point errors like division by zero. However, be aware that LLMs currently struggle more with subtle numerical issues such as underflow and cancellation, requiring your team to implement additional, specialized checks or human review for these categories. Prioritize LLM deployment for high-confidence error types first.
Key insights
LLMs can effectively classify floating-point errors in C code, though performance varies by error type.
Principles
- LLMs show strong capability in static code analysis for numerical errors.
- Error classification performance differs significantly across floating-point error categories.
- Benchmarking LLMs for specific code analysis tasks requires tailored datasets.
Method
InterFLOPBench evaluates LLMs on floating-point error detection as a multi-label classification problem using F1-score across 90 C kernels and 1,130 test samples.
In practice
- Use LLMs for initial static analysis of C code for numerical issues.
- Prioritize human review for subtle errors like underflow and cancellation.
- Integrate LLM-based tools into CI/CD pipelines for early detection.
Topics
- Large Language Models
- Floating-Point Errors
- Static Code Analysis
- InterFLOPBench
- Software Benchmarking
- Numerical Stability
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.