VDE Bench: Evaluating The Capability of Image Editing Models to Modify Visual Documents

2025-11-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, extended

Summary

VDE Bench is a new, human-annotated benchmark designed to evaluate image editing models' capabilities in modifying multilingual and complex visual documents. It addresses limitations of existing approaches like AnyText and GlyphControl, which primarily focus on English and sparse textual layouts, by including densely textual documents in both English and Chinese, such as academic papers, posters, and newspapers. The benchmark features a high-quality dataset of 674 instruction-modified images and introduces a decoupled evaluation framework that quantifies editing performance at the OCR parsing level for fine-grained accuracy assessment. Initial evaluations of representative image editing models using VDE Bench reveal varying performance, with Qwen-Image-Edit showing strong local editing but poor layout preservation, and all models exhibiting noticeable shortcomings in Chinese text editing compared to English.

Key takeaway

For machine learning engineers developing or deploying image editing solutions for visual documents, you must recognize that current models, even leading ones, exhibit significant limitations in handling multilingual content, especially Chinese, and complex text layouts. Your evaluation should move beyond English-centric, sparse-text benchmarks. Prioritize models demonstrating robust performance on dense, multilingual documents, and be aware that text addition remains a major challenge requiring dedicated research or specialized solutions.

Key insights

Image editing models require specialized benchmarks to accurately assess their performance on multilingual and densely textual visual documents.

Principles

Current image editing benchmarks largely neglect dense text and non-Latin scripts.
Decoupled evaluation via OCR parsing offers fine-grained performance diagnostics.
Text addition is a critical bottleneck, posing greater challenges than deletion or replacement.

Method

VDE Bench is constructed by generating text modification instructions (add, delete, replace) for English and Chinese documents, producing edited images with Nano Banana Pro, and conducting rigorous human review and OCR-based validation using PaddleOCR-VL.

In practice

When evaluating image editing models, include multilingual and dense document scenarios.
Utilize OCR parsing for precise, localized assessment of text modification quality.
Address the significant performance gap in handling text addition instructions.

Topics

VDE Bench
Image Editing Models
Visual Documents
Multilingual Text
OCR Evaluation
Benchmark Datasets
Text Modification

Best for: Research Scientist, NLP Engineer, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.