UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs

2026-04-21 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

UniEditBench is a new, unified benchmark designed for evaluating both image and video editing models, addressing fragmentation across existing evaluation methods and modalities. It supports reconstruction-based and instruction-driven paradigms under a shared protocol, featuring a taxonomy of nine image operations and eight video operations, including challenging compositional tasks like counting and spatial reordering. To enable scalable and cost-effective evaluation, UniEditBench distills a high-capacity MLLM judge, Qwen3-VL-235B-A22B Instruct, into lightweight 4B/8B evaluators. These distilled models provide multi-dimensional scoring across structural fidelity, text alignment, background consistency, naturalness, and temporal-spatial consistency for videos. Experiments demonstrate that the distilled evaluators maintain strong agreement with human judgments while substantially reducing computational and financial costs compared to the teacher model.

Key takeaway

For research scientists developing or evaluating visual editing models, UniEditBench offers a standardized, cost-effective solution to overcome fragmented evaluation. You should consider adopting its multi-dimensional metrics and distilled MLLM evaluators to achieve more consistent, interpretable, and scalable assessments of both image and video editing performance, especially for complex compositional tasks. This approach can significantly reduce the computational overhead associated with high-fidelity evaluation.

Key insights

UniEditBench unifies image and video editing evaluation using cost-effective, distilled MLLM judges aligned with human preferences.

Principles

Unified evaluation across paradigms is crucial.
Multi-dimensional metrics offer interpretable analysis.
Distillation enables scalable, cost-effective MLLM evaluation.

Method

A high-capacity MLLM teacher (Qwen3-VL-235B-A22B) is distilled into 4B/8B student evaluators using a two-stage LoRA fine-tuning curriculum (spatial then temporal) to preserve reasoning capabilities.

In practice

Use 4B/8B distilled evaluators for cost-effective benchmarking.
Employ multi-dimensional scoring for detailed failure analysis.
Standardize prompt interfaces for cross-paradigm comparison.

Topics

UniEditBench
Image and Video Editing
MLLM-based Evaluation
Knowledge Distillation
Multi-dimensional Metrics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.