Judging to Improve: A De-biased VLM-as-3D-Judge Protocol for Single-Image 3D Generation

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

A de-biased VLM-as-3D-judge protocol has been developed to reliably rank single-image-to-3D mesh quality, addressing limitations of cheaper geometry and CLIP proxies. This paper extends that judge's application from ranking to optimizing a strong open generator, TRELLIS, for specific asset classes like furniture, without human labels. The work introduces an "optimization-grade hardening" of the judge, employing a distinct training judge (Qwen2.5-VL-7B) and evaluation judge (InternVL3-8B) to prevent circularity. It also incorporates position-bias correction and fixes for three failure modes: image overload, geometry-hiding splat renders, and reference-free judging. Calibration evidence shows clear-gap win-rates of 0.83-1.0 and base-vs-base rates around 0.5. Using this protocol with public models and lightweight parameter-efficient adaptation, the methods achieved parity (0.50) with the strong base model under severe degradation, but did not exceed it, failing to clear the >=65% win-rate target. The study found that clean inputs saturate the judge, flow-DIT fine-tuning washes out, and conditioning repair is key for geometry movement.

Key takeaway

For Machine Learning Engineers developing single-image 3D generation models, you should consider implementing a de-biased VLM-as-3D-judge protocol for automated quality assessment and optimization. Your efforts to exceed strong public-data baselines will likely require more than lightweight PEFT on public datasets. Focus your adaptation strategies on conditioner repair, as this is where geometry changes are most effectively driven. Also, ensure you engineer signal through quality-contrastive data construction to provide learnable preferences for the judge.

Key insights

A hardened VLM-as-3D-judge protocol can specialize 3D generators, but exceeding strong baselines requires more than lightweight PEFT.

Principles

Method

The protocol involves using distinct VLM judges (e.g., Qwen2.5-VL-7B for training, InternVL3-8B for evaluation), applying position-bias correction, and addressing image overload, splat renders, and reference-free judging.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.