17 AI Models Tested on REAL Scientific Research

· Source: Discover AI · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Research Methodology & Innovation · Depth: Intermediate, extended

Summary

The Shanghai Artificial Intelligence Lab introduced a new benchmark for end-to-end autonomous scientific research, published on June 8th (declared May 28th). This benchmark evaluates 17 AI models across 10 scientific domains using 40 human-defined tasks, focusing on the AI's ability to independently rediscover scientific patterns from provided datasets and literature. Unlike benchmarks relying on AI judges, this system employs human domain experts for evaluation, assessing the complete research trajectory, including experimental code, figures, and reproducible workflows. Results indicate that even leading models like Claude Code and GPT-5.5 achieve low scientific quality scores, generally around 20%, highlighting current AI limitations in complex scientific discovery. Notably, Qwen-3.7-Max, when paired with a minimal "research harness," often demonstrated superior performance, especially in physics, and presented a compelling cost-performance trade-off compared to more expensive, fully autonomous agents.

Key takeaway

For research scientists and ML engineers selecting AI models for scientific discovery, recognize that expensive, fully autonomous agents like Claude Code do not consistently outperform simpler LLM-plus-harness configurations. You should evaluate models like Qwen-3.7-Max with a minimal research harness, especially for domains like physics, as they offer comparable or superior performance at significantly lower costs. Prioritize domain-specific performance and cost-efficiency over general agent complexity to optimize your scientific AI workflows.

Key insights

Advanced AI models currently demonstrate limited capability in independent scientific discovery, even with curated data and extensive tools.

Principles

Method

The benchmark provides AI agents with task instructions, curated datasets, literature, and a workspace. It requires generating experimental code, figures, reasoning traces, and reproducible workflows, all evaluated by human domain experts.

In practice

Topics

Best for: AI Engineer, AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Discover AI.