COMPOSITE-Stem

2026-04-21 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

COMPOSITE-STEM is a new benchmark featuring 70 expert-curated tasks across physics, biology, chemistry, and mathematics, designed to evaluate AI agents in scientific discovery workflows. Developed by doctoral-level researchers, it addresses the limitations of saturated benchmarks by combining exact-match grading with LLM-as-a-jury criterion-based rubrics for more flexible assessment of scientifically meaningful outputs. The benchmark utilizes an adapted multimodal Terminus-2 agent harness within the Harbor evaluation framework. Initial evaluations of four frontier models show the top-performing model, claude-opus-4.6, achieved only 21.4% (Pass@1), indicating that COMPOSITE-STEM captures capabilities beyond current agent reach. All tasks are open-sourced to promote reproducibility and further research in AI's acceleration of scientific progress.

Key takeaway

For AI scientists and machine learning engineers developing agents for scientific applications, COMPOSITE-STEM provides a challenging, open-source benchmark to rigorously test and improve agent capabilities. Your teams should focus on enhancing agent persistence, tool integration, and multimodal understanding, as current frontier models achieve only 21.4% on these expert-curated tasks, highlighting significant room for improvement in real-world scientific problem-solving.

Key insights

COMPOSITE-STEM offers a rigorous, expert-curated benchmark for evaluating AI agents in complex scientific domains.

Principles

Expert-authored tasks enhance benchmark rigor.
Flexible grading beyond exact-match improves assessment.
Reproducible environments are crucial for agent evaluation.

Method

The AsymmetryZero grading protocol uses criterion-centric contracts with both exact-match and multi-judge LLM scoring, aggregating votes for semantic correctness. Tasks are executed in a sandboxed Harbor environment with a multimodal agent harness.

In practice

Use RDKit for robust cheminformatics tasks.
Prioritize agent persistence and tool use for complex problems.

Topics

COMPOSITE-STEM Benchmark
AI Agent Evaluation
Scientific Discovery
LLM-as-a-Jury Grading
Harbor Framework

Code references

Best for: AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.