Can Agents Price a Reaction? Evaluating LLMs on Chemical Cost Reasoning
Summary
A new benchmark called ChemCost evaluates Large Language Models (LLMs) as tool-using agents for chemical procurement cost estimation. This benchmark comprises 1,427 evaluable reactions, grounded to a frozen pricing snapshot of 2,261 chemicals and 230,775 supplier quotes. The task requires agents to ground chemical identities, retrieve supplier quotes, select valid purchasable packs, normalize quantities, and compute costs from reaction descriptions. ChemCost also includes controlled noise-injected views to test robustness against chemical aliases, quantity expressions, missing fields, and input formatting. Experiments with various LLM agents, including frontier, open-weight, and chemistry-specialized models, show that while tool access is necessary, it is insufficient for solving the task. The strongest agents achieved only 50.6% accuracy within a 25% relative error on clean inputs, with performance degrading significantly under realistic noise, primarily due to brittle parsing, ineffective evidence integration, invalid pack selection, and non-convergent tool use.
Key takeaway
For Machine Learning Engineers developing scientific agents, recognize that current LLMs, even with tool access, are highly susceptible to input noise and struggle with multi-step quantitative reasoning in chemistry. Prioritize developing robust parsing mechanisms for varied chemical text formats and enhance agents' ability to integrate retrieved evidence and perform accurate, constrained pack selection. Your efforts should focus on improving the reliability of tool-use trajectories and reducing non-convergent tool calls, especially for multi-step chemical synthesis routes.
Key insights
LLM agents struggle with real-world chemical procurement cost estimation, even with tool access, due to parsing and reasoning failures.
Principles
- Tool access is necessary but insufficient for complex scientific reasoning.
- Input format noise significantly degrades agent performance.
- Route depth is a dominant factor in task difficulty.
Method
ChemCost evaluates LLM agents on chemical procurement cost estimation by requiring them to resolve chemical names, retrieve supplier quotes, select valid packs, normalize quantities, and aggregate costs for 1,427 reactions.
In practice
- Focus on robust parsing for noisy chemical text inputs.
- Improve evidence integration in multi-step tool-use workflows.
- Develop agents that can handle varied quantity expressions.
Topics
- LLM Agents
- Chemical Procurement
- ChemCost Benchmark
- Scientific Tool Use
- Cost Estimation
Best for: AI Scientist, Research Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.