Can Agents Price a Reaction? Evaluating LLMs on Chemical Cost Reasoning

2026-05-08 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

The ChemCost benchmark evaluates Large Language Models (LLMs) on chemical procurement cost estimation, a practical task requiring agents to identify chemicals, retrieve supplier quotes, select purchasable packs, normalize quantities, and compute costs from reaction descriptions. Introduced to address the limited rigorous evaluation of scientific tool use in LLMs, ChemCost comprises 1,427 evaluable reactions grounded to a frozen pricing snapshot of 2,261 chemicals and 230,775 supplier quotes. It supports scalar scoring and stage-level diagnosis for failures in grounding, retrieval, procurement, and arithmetic. The benchmark also includes noise-injected views to test robustness against perturbed chemical aliases, quantity expressions, missing fields, and input formatting. Experiments with various LLM agents show that even the strongest achieve only 50.6% accuracy within 25% relative error on clean inputs, degrading significantly with noise, primarily due to brittle parsing, ineffective evidence integration, invalid pack selection, and non-convergent tool use.

Key takeaway

For AI Scientists developing LLM agents for scientific applications, recognize that current models, even with tool access, exhibit significant limitations in multi-stage reasoning and robustness. Prioritize developing more resilient parsing mechanisms, enhancing evidence integration, and improving tool-use convergence to achieve reliable performance in tasks like chemical procurement cost estimation. Your evaluation benchmarks should include diverse noise injections and stage-level diagnostics to pinpoint specific failure modes.

Key insights

LLM agents struggle with complex, multi-stage scientific reasoning tasks like chemical cost estimation, even with tool access.

Principles

Tool access alone is insufficient for complex scientific tasks.
Robustness requires handling diverse input noise.
Stage-level analysis reveals specific failure points.

Method

ChemCost evaluates LLM agents on chemical procurement cost estimation using 1,427 reactions, 2,261 chemicals, and 230,775 supplier quotes, enabling scalar scoring and stage-level diagnostics for errors.

In practice

Implement stage-level diagnostics for agent failures.
Test LLM agents with realistic noise injections.
Focus on improving parsing and evidence integration.

Topics

LLM Agents
Chemical Cost Reasoning
ChemCost Benchmark
Scientific Tool Use
Language Model Evaluation

Best for: AI Scientist, Research Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.