UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs
Summary
UniEditBench is a new, unified benchmark designed for evaluating both image and video editing models, addressing fragmentation across existing evaluation methods and modalities. It supports reconstruction-based and instruction-driven paradigms under a shared protocol, featuring a taxonomy of nine image operations and eight video operations, including challenging compositional tasks like counting and spatial reordering. To enable scalable and cost-effective evaluation, UniEditBench distills a high-capacity MLLM judge, Qwen3-VL-235B-A22B Instruct, into lightweight 4B/8B evaluators. These distilled models provide multi-dimensional scoring across structural fidelity, text alignment, background consistency, naturalness, and temporal-spatial consistency for videos. Experiments demonstrate that the distilled evaluators maintain strong agreement with human judgments while substantially reducing computational and financial costs compared to the teacher model.
Key takeaway
For research scientists developing or evaluating visual editing models, UniEditBench offers a standardized, cost-effective solution to overcome fragmented evaluation. You should consider adopting its multi-dimensional metrics and distilled MLLM evaluators to achieve more consistent, interpretable, and scalable assessments of both image and video editing performance, especially for complex compositional tasks. This approach can significantly reduce the computational overhead associated with high-fidelity evaluation.
Key insights
UniEditBench unifies image and video editing evaluation using cost-effective, distilled MLLM judges aligned with human preferences.
Principles
- Unified evaluation across paradigms is crucial.
- Multi-dimensional metrics offer interpretable analysis.
- Distillation enables scalable, cost-effective MLLM evaluation.
Method
A high-capacity MLLM teacher (Qwen3-VL-235B-A22B) is distilled into 4B/8B student evaluators using a two-stage LoRA fine-tuning curriculum (spatial then temporal) to preserve reasoning capabilities.
In practice
- Use 4B/8B distilled evaluators for cost-effective benchmarking.
- Employ multi-dimensional scoring for detailed failure analysis.
- Standardize prompt interfaces for cross-paradigm comparison.
Topics
- UniEditBench
- Image and Video Editing
- MLLM-based Evaluation
- Knowledge Distillation
- Multi-dimensional Metrics
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.