ThermoQA: A Three-Tier Benchmark for Evaluating Thermodynamic Reasoning in Large Language Models
Summary
ThermoQA is a new, open-ended benchmark released in March 2026, designed to evaluate Large Language Models' (LLMs) thermodynamic reasoning capabilities. It comprises 293 engineering thermodynamics problems across three tiers: property lookups (110 questions), component analysis (101 questions), and full cycle analysis (82 questions). Ground truth for all problems is programmatically computed using CoolProp 7.2.0 and NASA polynomial correlations, covering water, R-134a, and variable-$c_{p}$ air. Six frontier LLMs were evaluated, with Claude Opus 4.6 leading the composite leaderboard at 94.1%, followed by GPT-5.4 (93.1%) and Gemini 3.1 Pro (92.5%). The benchmark revealed significant cross-tier performance degradation, ranging from 2.8 pp to 32.5 pp, indicating that property memorization does not equate to thermodynamic reasoning. Supercritical water, R-134a refrigerant, and combined-cycle gas turbine analysis were identified as key discriminators, showing 40–60 pp performance spreads. Multi-run consistency, measured by $sigma$, varied from $pm$0.1% to $pm$2.5%, highlighting reliability as a distinct evaluation axis. The dataset and code are open-source.
Key takeaway
For AI Scientists and Machine Learning Engineers developing or deploying LLMs for quantitative engineering tasks, you should prioritize models that demonstrate consistent performance across increasing complexity tiers, especially for real-fluid and cycle analysis problems. Your evaluation should include multi-run consistency metrics, as high mean accuracy with high variance can lead to unreliable outputs in practical applications. Consider augmenting LLMs with external thermodynamic property solvers to mitigate property retrieval failures and focus on core reasoning capabilities.
Key insights
Thermodynamic reasoning in LLMs requires more than memorization, with performance degrading significantly on complex, real-fluid problems.
Principles
- Property memorization does not imply thermodynamic reasoning.
- Errors in thermodynamic calculations cascade significantly.
- LLM consistency varies widely across problem complexity.
Method
ThermoQA uses a three-tiered, open-ended problem structure with programmatic ground truth from CoolProp and NASA polynomials, evaluated via multi-run consistency analysis and weighted step-level scoring.
In practice
- Focus LLM training on real-fluid properties like R-134a.
- Implement self-verification for intermediate calculation steps.
- Prioritize models with low multi-run standard deviation for reliability.
Topics
- ThermoQA Benchmark
- Thermodynamic Reasoning
- Large Language Models
- Real-Fluid Properties
- Cycle Analysis
Code references
Best for: AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.