COMPOSITE-Stem
Summary
COMPOSITE-STEM is a new benchmark featuring 70 expert-curated tasks across physics, biology, chemistry, and mathematics, designed to evaluate AI agents in scientific discovery workflows. Developed by doctoral-level researchers, it addresses the limitations of saturated benchmarks by combining exact-match grading with LLM-as-a-jury criterion-based rubrics for more flexible assessment of scientifically meaningful outputs. The benchmark utilizes an adapted multimodal Terminus-2 agent harness within the Harbor evaluation framework. Initial evaluations of four frontier models show the top-performing model, claude-opus-4.6, achieved only 21.4% (Pass@1), indicating that COMPOSITE-STEM captures capabilities beyond current agent reach. All tasks are open-sourced to promote reproducibility and further research in AI's acceleration of scientific progress.
Key takeaway
For AI scientists and machine learning engineers developing agents for scientific applications, COMPOSITE-STEM provides a challenging, open-source benchmark to rigorously test and improve agent capabilities. Your teams should focus on enhancing agent persistence, tool integration, and multimodal understanding, as current frontier models achieve only 21.4% on these expert-curated tasks, highlighting significant room for improvement in real-world scientific problem-solving.
Key insights
COMPOSITE-STEM offers a rigorous, expert-curated benchmark for evaluating AI agents in complex scientific domains.
Principles
- Expert-authored tasks enhance benchmark rigor.
- Flexible grading beyond exact-match improves assessment.
- Reproducible environments are crucial for agent evaluation.
Method
The AsymmetryZero grading protocol uses criterion-centric contracts with both exact-match and multi-judge LLM scoring, aggregating votes for semantic correctness. Tasks are executed in a sandboxed Harbor environment with a multimodal agent harness.
In practice
- Use RDKit for robust cheminformatics tasks.
- Prioritize agent persistence and tool use for complex problems.
Topics
- COMPOSITE-STEM Benchmark
- AI Agent Evaluation
- Scientific Discovery
- LLM-as-a-Jury Grading
- Harbor Framework
Code references
Best for: AI Scientist, Research Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.