ChartDiff: A Large-Scale Benchmark for Comprehending Pairs of Charts

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

ChartDiff is a new large-scale benchmark designed to evaluate vision-language models (VLMs) on cross-chart comparative summarization, a task where models describe differences between pairs of charts. It comprises 8,541 chart pairs, featuring diverse data sources, chart types (including line, bar, multi-series, and pie charts), and visual styles rendered using Matplotlib, Plotly, and Plotnine. Each pair is annotated with LLM-generated and human-verified summaries detailing differences in trends, fluctuations, and anomalies. Evaluations on ChartDiff reveal that frontier general-purpose models like GPT-5.4 achieve the highest GPT-based quality scores (4.95), while specialized and pipeline-based methods yield higher ROUGE scores but lower human-aligned evaluation, indicating a mismatch between lexical overlap and actual summary quality. Multi-series charts remain particularly challenging across all model families, though strong end-to-end models show robustness to varying plotting libraries.

Key takeaway

For research scientists developing or evaluating vision-language models for data analysis, you should prioritize benchmarks that assess comparative reasoning across multiple charts, such as ChartDiff. Your evaluation metrics must extend beyond lexical overlap (e.g., ROUGE) to include human-aligned quality scores (e.g., GPT Score) to accurately reflect model performance. Focus your efforts on enhancing VLM capabilities for complex chart types, particularly multi-series visualizations, as these present the greatest challenge for current models.

Key insights

Comparative chart reasoning remains a significant challenge for current vision-language models, despite advances in single-chart understanding.

Principles

Method

The ChartDiff benchmark uses a multi-stage annotation pipeline: LLM-generated candidate summaries are judged by a second LLM, then manually verified for factual correctness, completeness, and clarity, using underlying CSV data as the source of truth.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.