VDE Bench: Evaluating The Capability of Image Editing Models to Modify Visual Documents
Summary
VDE Bench is a new, human-annotated benchmark designed to evaluate image editing models' capabilities in modifying multilingual and complex visual documents. It addresses limitations of existing approaches like AnyText and GlyphControl, which primarily focus on English and sparse textual layouts, by including densely textual documents in both English and Chinese, such as academic papers, posters, and newspapers. The benchmark features a high-quality dataset of 674 instruction-modified images and introduces a decoupled evaluation framework that quantifies editing performance at the OCR parsing level for fine-grained accuracy assessment. Initial evaluations of representative image editing models using VDE Bench reveal varying performance, with Qwen-Image-Edit showing strong local editing but poor layout preservation, and all models exhibiting noticeable shortcomings in Chinese text editing compared to English.
Key takeaway
For machine learning engineers developing or deploying image editing solutions for visual documents, you must recognize that current models, even leading ones, exhibit significant limitations in handling multilingual content, especially Chinese, and complex text layouts. Your evaluation should move beyond English-centric, sparse-text benchmarks. Prioritize models demonstrating robust performance on dense, multilingual documents, and be aware that text addition remains a major challenge requiring dedicated research or specialized solutions.
Key insights
Image editing models require specialized benchmarks to accurately assess their performance on multilingual and densely textual visual documents.
Principles
- Current image editing benchmarks largely neglect dense text and non-Latin scripts.
- Decoupled evaluation via OCR parsing offers fine-grained performance diagnostics.
- Text addition is a critical bottleneck, posing greater challenges than deletion or replacement.
Method
VDE Bench is constructed by generating text modification instructions (add, delete, replace) for English and Chinese documents, producing edited images with Nano Banana Pro, and conducting rigorous human review and OCR-based validation using PaddleOCR-VL.
In practice
- When evaluating image editing models, include multilingual and dense document scenarios.
- Utilize OCR parsing for precise, localized assessment of text modification quality.
- Address the significant performance gap in handling text addition instructions.
Topics
- VDE Bench
- Image Editing Models
- Visual Documents
- Multilingual Text
- OCR Evaluation
- Benchmark Datasets
- Text Modification
Best for: Research Scientist, NLP Engineer, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.