17 AI Models Tested on REAL Scientific Research
Summary
The Shanghai Artificial Intelligence Lab introduced a new benchmark for end-to-end autonomous scientific research, published on June 8th (declared May 28th). This benchmark evaluates 17 AI models across 10 scientific domains using 40 human-defined tasks, focusing on the AI's ability to independently rediscover scientific patterns from provided datasets and literature. Unlike benchmarks relying on AI judges, this system employs human domain experts for evaluation, assessing the complete research trajectory, including experimental code, figures, and reproducible workflows. Results indicate that even leading models like Claude Code and GPT-5.5 achieve low scientific quality scores, generally around 20%, highlighting current AI limitations in complex scientific discovery. Notably, Qwen-3.7-Max, when paired with a minimal "research harness," often demonstrated superior performance, especially in physics, and presented a compelling cost-performance trade-off compared to more expensive, fully autonomous agents.
Key takeaway
For research scientists and ML engineers selecting AI models for scientific discovery, recognize that expensive, fully autonomous agents like Claude Code do not consistently outperform simpler LLM-plus-harness configurations. You should evaluate models like Qwen-3.7-Max with a minimal research harness, especially for domains like physics, as they offer comparable or superior performance at significantly lower costs. Prioritize domain-specific performance and cost-efficiency over general agent complexity to optimize your scientific AI workflows.
Key insights
Advanced AI models currently demonstrate limited capability in independent scientific discovery, even with curated data and extensive tools.
Principles
- Human domain experts are essential for robust AI scientific evaluation.
- Full autonomous agents do not guarantee superior scientific performance.
- Cost-performance trade-offs are critical for scientific AI model deployment.
Method
The benchmark provides AI agents with task instructions, curated datasets, literature, and a workspace. It requires generating experimental code, figures, reasoning traces, and reproducible workflows, all evaluated by human domain experts.
In practice
- Deploy lightweight research harnesses with LLMs for scientific tasks.
- Investigate Qwen-3.7-Max for physics-related scientific discovery.
- Prioritize cost-performance over complex autonomous agents for specific domains.
Topics
- Autonomous Scientific Research
- AI Model Benchmarking
- Research Harness
- Human-in-the-Loop Evaluation
- Qwen-3.7-Max
- Cost-Performance Analysis
Best for: AI Engineer, AI Scientist, Research Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Discover AI.