17 AI Models Tested on REAL Scientific Research

2026-06-10 · Source: Discover AI · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Research Methodology & Innovation · Depth: Intermediate, extended

Summary

The Shanghai Artificial Intelligence Lab introduced a new benchmark for end-to-end autonomous scientific research, published on June 8th (declared May 28th). This benchmark evaluates 17 AI models across 10 scientific domains using 40 human-defined tasks, focusing on the AI's ability to independently rediscover scientific patterns from provided datasets and literature. Unlike benchmarks relying on AI judges, this system employs human domain experts for evaluation, assessing the complete research trajectory, including experimental code, figures, and reproducible workflows. Results indicate that even leading models like Claude Code and GPT-5.5 achieve low scientific quality scores, generally around 20%, highlighting current AI limitations in complex scientific discovery. Notably, Qwen-3.7-Max, when paired with a minimal "research harness," often demonstrated superior performance, especially in physics, and presented a compelling cost-performance trade-off compared to more expensive, fully autonomous agents.

Key takeaway

For research scientists and ML engineers selecting AI models for scientific discovery, recognize that expensive, fully autonomous agents like Claude Code do not consistently outperform simpler LLM-plus-harness configurations. You should evaluate models like Qwen-3.7-Max with a minimal research harness, especially for domains like physics, as they offer comparable or superior performance at significantly lower costs. Prioritize domain-specific performance and cost-efficiency over general agent complexity to optimize your scientific AI workflows.

Key insights

Advanced AI models currently demonstrate limited capability in independent scientific discovery, even with curated data and extensive tools.

Principles

Human domain experts are essential for robust AI scientific evaluation.
Full autonomous agents do not guarantee superior scientific performance.
Cost-performance trade-offs are critical for scientific AI model deployment.

Method

The benchmark provides AI agents with task instructions, curated datasets, literature, and a workspace. It requires generating experimental code, figures, reasoning traces, and reproducible workflows, all evaluated by human domain experts.

In practice

Deploy lightweight research harnesses with LLMs for scientific tasks.
Investigate Qwen-3.7-Max for physics-related scientific discovery.
Prioritize cost-performance over complex autonomous agents for specific domains.

Topics

Autonomous Scientific Research
AI Model Benchmarking
Research Harness
Human-in-the-Loop Evaluation
Qwen-3.7-Max
Cost-Performance Analysis

Best for: AI Engineer, AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Discover AI.