DSAEval: Evaluating Data Science Agents on a Wide Range of Real-World Data Science Problems
Summary
DSAEval is a new benchmark designed to evaluate Large Language Model (LLM)-based data science agents on 641 real-world problems across 285 diverse datasets, including structured and unstructured data. It features Multimodal Environment Perception, Multi-Query Interactions, and Multi-Dimensional Evaluation, assessing reasoning, code, and results. A systematic evaluation of 11 advanced agentic LLMs revealed Claude-Sonnet-4.5 achieved the strongest overall performance (8.164 score), while GPT-5.2 was the most efficient (approx. 20,000 tokens) and MiMo-V2-Flash the most cost-effective (approx. \$0.007 per task). Multimodal perception consistently improved performance on vision-related tasks, with gains from 2.04% to 11.30%. Current agents excel on structured data and routine analysis but struggle with unstructured domains like Computer Vision and Natural Language Processing, and complex tasks such as model training and optimization.
Key takeaway
For Machine Learning Engineers evaluating data science agents, you should prioritize benchmarks that simulate real-world, iterative, and multimodal workflows like DSAEval. Consider Claude-Sonnet-4.5 for top performance, GPT-5.2 for efficiency, or MiMo-V2-Flash for cost-effectiveness, but be aware of their limitations in unstructured data domains. Integrate multimodal perception into your agent designs to significantly improve performance on vision-related tasks. Focus your development efforts on enhancing agent capabilities for complex deep learning and unstructured data challenges.
Key insights
Multimodal, multi-query benchmarks are crucial for evaluating LLM-based data science agents on real-world, open-ended problems.
Principles
- Multimodal perception boosts agent performance.
- Efficiency and cost vary significantly across LLMs.
- Agents struggle with unstructured data tasks.
Method
DSAEval uses a sandbox environment with Jupyter kernel and GPUs, enabling multimodal observations ($o_{t}^{\text{txt}}, o_{t}^{\text{tab}}, o_{t}^{\text{img}}\$) and multi-query interactions. An LLM-based judge scores reasoning, code, and results.
In practice
- Prioritize multimodal LLMs for vision tasks.
- Select agents based on performance, cost, or efficiency.
- Focus development on unstructured data challenges.
Topics
- Data Science Agents
- LLM Evaluation
- Multimodal AI
- Benchmark Datasets
- Claude-Sonnet-4.5
- GPT-5.2
- MiMo-V2-Flash
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.