ChartDiff: A Large-Scale Benchmark for Comprehending Pairs of Charts

2026-03-18 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

ChartDiff is a new large-scale benchmark designed to evaluate vision-language models (VLMs) on cross-chart comparative summarization, a task where models describe differences between pairs of charts. It comprises 8,541 chart pairs, featuring diverse data sources, chart types (including line, bar, multi-series, and pie charts), and visual styles rendered using Matplotlib, Plotly, and Plotnine. Each pair is annotated with LLM-generated and human-verified summaries detailing differences in trends, fluctuations, and anomalies. Evaluations on ChartDiff reveal that frontier general-purpose models like GPT-5.4 achieve the highest GPT-based quality scores (4.95), while specialized and pipeline-based methods yield higher ROUGE scores but lower human-aligned evaluation, indicating a mismatch between lexical overlap and actual summary quality. Multi-series charts remain particularly challenging across all model families, though strong end-to-end models show robustness to varying plotting libraries.

Key takeaway

For research scientists developing or evaluating vision-language models for data analysis, you should prioritize benchmarks that assess comparative reasoning across multiple charts, such as ChartDiff. Your evaluation metrics must extend beyond lexical overlap (e.g., ROUGE) to include human-aligned quality scores (e.g., GPT Score) to accurately reflect model performance. Focus your efforts on enhancing VLM capabilities for complex chart types, particularly multi-series visualizations, as these present the greatest challenge for current models.

Key insights

Comparative chart reasoning remains a significant challenge for current vision-language models, despite advances in single-chart understanding.

Principles

Lexical overlap metrics do not reliably indicate human-aligned summary quality.
Chart complexity, especially multi-series data, significantly impacts VLM performance.
End-to-end VLMs are more robust to plotting library variations than pipeline methods.

Method

The ChartDiff benchmark uses a multi-stage annotation pipeline: LLM-generated candidate summaries are judged by a second LLM, then manually verified for factual correctness, completeness, and clarity, using underlying CSV data as the source of truth.

In practice

Prioritize GPT Score over ROUGE for evaluating comparative chart summarization.
Focus VLM development on improving multi-series chart understanding.
Consider end-to-end VLMs for robustness across diverse chart rendering styles.

Topics

ChartDiff Benchmark
Cross-chart Summarization
Vision-Language Models
Comparative Reasoning
Multi-series Charts

Code references

has2k1/plotnine

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.