ThermoQA: A Three-Tier Benchmark for Evaluating Thermodynamic Reasoning in Large Language Models

2026-04-24 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

ThermoQA is a new, open-ended benchmark released in March 2026, designed to evaluate Large Language Models' (LLMs) thermodynamic reasoning capabilities. It comprises 293 engineering thermodynamics problems across three tiers: property lookups (110 questions), component analysis (101 questions), and full cycle analysis (82 questions). Ground truth for all problems is programmatically computed using CoolProp 7.2.0 and NASA polynomial correlations, covering water, R-134a, and variable-$c_{p}$ air. Six frontier LLMs were evaluated, with Claude Opus 4.6 leading the composite leaderboard at 94.1%, followed by GPT-5.4 (93.1%) and Gemini 3.1 Pro (92.5%). The benchmark revealed significant cross-tier performance degradation, ranging from 2.8 pp to 32.5 pp, indicating that property memorization does not equate to thermodynamic reasoning. Supercritical water, R-134a refrigerant, and combined-cycle gas turbine analysis were identified as key discriminators, showing 40–60 pp performance spreads. Multi-run consistency, measured by $sigma$, varied from $pm$0.1% to $pm$2.5%, highlighting reliability as a distinct evaluation axis. The dataset and code are open-source.

Key takeaway

For AI Scientists and Machine Learning Engineers developing or deploying LLMs for quantitative engineering tasks, you should prioritize models that demonstrate consistent performance across increasing complexity tiers, especially for real-fluid and cycle analysis problems. Your evaluation should include multi-run consistency metrics, as high mean accuracy with high variance can lead to unreliable outputs in practical applications. Consider augmenting LLMs with external thermodynamic property solvers to mitigate property retrieval failures and focus on core reasoning capabilities.

Key insights

Thermodynamic reasoning in LLMs requires more than memorization, with performance degrading significantly on complex, real-fluid problems.

Principles

Property memorization does not imply thermodynamic reasoning.
Errors in thermodynamic calculations cascade significantly.
LLM consistency varies widely across problem complexity.

Method

ThermoQA uses a three-tiered, open-ended problem structure with programmatic ground truth from CoolProp and NASA polynomials, evaluated via multi-run consistency analysis and weighted step-level scoring.

In practice

Focus LLM training on real-fluid properties like R-134a.
Implement self-verification for intermediate calculation steps.
Prioritize models with low multi-run standard deviation for reliability.

Topics

ThermoQA Benchmark
Thermodynamic Reasoning
Large Language Models
Real-Fluid Properties
Cycle Analysis

Code references

olivenet-iot/ThermoQA

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.