Can Agents Price a Reaction? Evaluating LLMs on Chemical Cost Reasoning
Summary
The ChemCost benchmark evaluates Large Language Models (LLMs) on chemical procurement cost estimation, a practical task requiring agents to identify chemicals, retrieve supplier quotes, select purchasable packs, normalize quantities, and compute costs from reaction descriptions. Introduced to address the limited rigorous evaluation of scientific tool use in LLMs, ChemCost comprises 1,427 evaluable reactions grounded to a frozen pricing snapshot of 2,261 chemicals and 230,775 supplier quotes. It supports scalar scoring and stage-level diagnosis for failures in grounding, retrieval, procurement, and arithmetic. The benchmark also includes noise-injected views to test robustness against perturbed chemical aliases, quantity expressions, missing fields, and input formatting. Experiments with various LLM agents show that even the strongest achieve only 50.6% accuracy within 25% relative error on clean inputs, degrading significantly with noise, primarily due to brittle parsing, ineffective evidence integration, invalid pack selection, and non-convergent tool use.
Key takeaway
For AI Scientists developing LLM agents for scientific applications, recognize that current models, even with tool access, exhibit significant limitations in multi-stage reasoning and robustness. Prioritize developing more resilient parsing mechanisms, enhancing evidence integration, and improving tool-use convergence to achieve reliable performance in tasks like chemical procurement cost estimation. Your evaluation benchmarks should include diverse noise injections and stage-level diagnostics to pinpoint specific failure modes.
Key insights
LLM agents struggle with complex, multi-stage scientific reasoning tasks like chemical cost estimation, even with tool access.
Principles
- Tool access alone is insufficient for complex scientific tasks.
- Robustness requires handling diverse input noise.
- Stage-level analysis reveals specific failure points.
Method
ChemCost evaluates LLM agents on chemical procurement cost estimation using 1,427 reactions, 2,261 chemicals, and 230,775 supplier quotes, enabling scalar scoring and stage-level diagnostics for errors.
In practice
- Implement stage-level diagnostics for agent failures.
- Test LLM agents with realistic noise injections.
- Focus on improving parsing and evidence integration.
Topics
- LLM Agents
- Chemical Cost Reasoning
- ChemCost Benchmark
- Scientific Tool Use
- Language Model Evaluation
Best for: AI Scientist, Research Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.