ChartDiff: A Large-Scale Benchmark for Comprehending Pairs of Charts
Summary
ChartDiff is a new large-scale benchmark designed to evaluate vision-language models (VLMs) on cross-chart comparative summarization, a task where models describe differences between pairs of charts. It comprises 8,541 chart pairs, featuring diverse data sources, chart types (including line, bar, multi-series, and pie charts), and visual styles rendered using Matplotlib, Plotly, and Plotnine. Each pair is annotated with LLM-generated and human-verified summaries detailing differences in trends, fluctuations, and anomalies. Evaluations on ChartDiff reveal that frontier general-purpose models like GPT-5.4 achieve the highest GPT-based quality scores (4.95), while specialized and pipeline-based methods yield higher ROUGE scores but lower human-aligned evaluation, indicating a mismatch between lexical overlap and actual summary quality. Multi-series charts remain particularly challenging across all model families, though strong end-to-end models show robustness to varying plotting libraries.
Key takeaway
For research scientists developing or evaluating vision-language models for data analysis, you should prioritize benchmarks that assess comparative reasoning across multiple charts, such as ChartDiff. Your evaluation metrics must extend beyond lexical overlap (e.g., ROUGE) to include human-aligned quality scores (e.g., GPT Score) to accurately reflect model performance. Focus your efforts on enhancing VLM capabilities for complex chart types, particularly multi-series visualizations, as these present the greatest challenge for current models.
Key insights
Comparative chart reasoning remains a significant challenge for current vision-language models, despite advances in single-chart understanding.
Principles
- Lexical overlap metrics do not reliably indicate human-aligned summary quality.
- Chart complexity, especially multi-series data, significantly impacts VLM performance.
- End-to-end VLMs are more robust to plotting library variations than pipeline methods.
Method
The ChartDiff benchmark uses a multi-stage annotation pipeline: LLM-generated candidate summaries are judged by a second LLM, then manually verified for factual correctness, completeness, and clarity, using underlying CSV data as the source of truth.
In practice
- Prioritize GPT Score over ROUGE for evaluating comparative chart summarization.
- Focus VLM development on improving multi-series chart understanding.
- Consider end-to-end VLMs for robustness across diverse chart rendering styles.
Topics
- ChartDiff Benchmark
- Cross-chart Summarization
- Vision-Language Models
- Comparative Reasoning
- Multi-series Charts
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.