DSAEval: Evaluating Data Science Agents on a Wide Range of Real-World Data Science Problems

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

DSAEval is a new benchmark designed to evaluate Large Language Model (LLM)-based data science agents on 641 real-world problems across 285 diverse datasets, including structured and unstructured data. It features Multimodal Environment Perception, Multi-Query Interactions, and Multi-Dimensional Evaluation, assessing reasoning, code, and results. A systematic evaluation of 11 advanced agentic LLMs revealed Claude-Sonnet-4.5 achieved the strongest overall performance (8.164 score), while GPT-5.2 was the most efficient (approx. 20,000 tokens) and MiMo-V2-Flash the most cost-effective (approx. \$0.007 per task). Multimodal perception consistently improved performance on vision-related tasks, with gains from 2.04% to 11.30%. Current agents excel on structured data and routine analysis but struggle with unstructured domains like Computer Vision and Natural Language Processing, and complex tasks such as model training and optimization.

Key takeaway

For Machine Learning Engineers evaluating data science agents, you should prioritize benchmarks that simulate real-world, iterative, and multimodal workflows like DSAEval. Consider Claude-Sonnet-4.5 for top performance, GPT-5.2 for efficiency, or MiMo-V2-Flash for cost-effectiveness, but be aware of their limitations in unstructured data domains. Integrate multimodal perception into your agent designs to significantly improve performance on vision-related tasks. Focus your development efforts on enhancing agent capabilities for complex deep learning and unstructured data challenges.

Key insights

Multimodal, multi-query benchmarks are crucial for evaluating LLM-based data science agents on real-world, open-ended problems.

Principles

Method

DSAEval uses a sandbox environment with Jupyter kernel and GPUs, enabling multimodal observations ($o_{t}^{\text{txt}}, o_{t}^{\text{tab}}, o_{t}^{\text{img}}\$) and multi-query interactions. An LLM-based judge scores reasoning, code, and results.

In practice

Topics

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.