UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

UniEditBench is a new, unified benchmark designed for evaluating both image and video editing models, addressing fragmentation across existing evaluation methods and modalities. It supports reconstruction-based and instruction-driven paradigms under a shared protocol, featuring a taxonomy of nine image operations and eight video operations, including challenging compositional tasks like counting and spatial reordering. To enable scalable and cost-effective evaluation, UniEditBench distills a high-capacity MLLM judge, Qwen3-VL-235B-A22B Instruct, into lightweight 4B/8B evaluators. These distilled models provide multi-dimensional scoring across structural fidelity, text alignment, background consistency, naturalness, and temporal-spatial consistency for videos. Experiments demonstrate that the distilled evaluators maintain strong agreement with human judgments while substantially reducing computational and financial costs compared to the teacher model.

Key takeaway

For research scientists developing or evaluating visual editing models, UniEditBench offers a standardized, cost-effective solution to overcome fragmented evaluation. You should consider adopting its multi-dimensional metrics and distilled MLLM evaluators to achieve more consistent, interpretable, and scalable assessments of both image and video editing performance, especially for complex compositional tasks. This approach can significantly reduce the computational overhead associated with high-fidelity evaluation.

Key insights

UniEditBench unifies image and video editing evaluation using cost-effective, distilled MLLM judges aligned with human preferences.

Principles

Method

A high-capacity MLLM teacher (Qwen3-VL-235B-A22B) is distilled into 4B/8B student evaluators using a two-stage LoRA fine-tuning curriculum (spatial then temporal) to preserve reasoning capabilities.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.