VDE Bench: Evaluating The Capability of Image Editing Models to Modify Visual Documents

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, extended

Summary

VDE Bench is a new, human-annotated benchmark designed to evaluate image editing models' capabilities in modifying multilingual and complex visual documents. It addresses limitations of existing approaches like AnyText and GlyphControl, which primarily focus on English and sparse textual layouts, by including densely textual documents in both English and Chinese, such as academic papers, posters, and newspapers. The benchmark features a high-quality dataset of 674 instruction-modified images and introduces a decoupled evaluation framework that quantifies editing performance at the OCR parsing level for fine-grained accuracy assessment. Initial evaluations of representative image editing models using VDE Bench reveal varying performance, with Qwen-Image-Edit showing strong local editing but poor layout preservation, and all models exhibiting noticeable shortcomings in Chinese text editing compared to English.

Key takeaway

For machine learning engineers developing or deploying image editing solutions for visual documents, you must recognize that current models, even leading ones, exhibit significant limitations in handling multilingual content, especially Chinese, and complex text layouts. Your evaluation should move beyond English-centric, sparse-text benchmarks. Prioritize models demonstrating robust performance on dense, multilingual documents, and be aware that text addition remains a major challenge requiring dedicated research or specialized solutions.

Key insights

Image editing models require specialized benchmarks to accurately assess their performance on multilingual and densely textual visual documents.

Principles

Method

VDE Bench is constructed by generating text modification instructions (add, delete, replace) for English and Chinese documents, producing edited images with Nano Banana Pro, and conducting rigorous human review and OCR-based validation using PaddleOCR-VL.

In practice

Topics

Best for: Research Scientist, NLP Engineer, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.