Direct 3D-Aware Object Insertion via Decomposed Visual Proxies

2026-06-08 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, extended

Summary

The DIRECT (Decomposed Injection for REference Composition and Target-integration) framework introduces a novel approach to pose-controllable object insertion, addressing the limitations of current 2D inpainting methods that lack explicit 3D pose control. This system integrates interactive 3D pose manipulation with high-fidelity 2D image synthesis. DIRECT decomposes insertion conditions into appearance, geometry (from a user-adjusted 3D proxy), and context guidance, injecting them through separate pathways to prevent feature entanglement. It also features an automated data construction pipeline, curating a hybrid dataset of over 160k pairs from SA-1B and MVImgNet. Experiments show DIRECT, implemented on FLUX.1-Fill-dev, consistently outperforms baselines in geometric controllability and visual quality, achieving superior PSNR, SSIM, LPIPS, CLIP-I, DINO scores, and lower Matching Error.

Key takeaway

For Machine Learning Engineers developing generative media or augmented reality applications, DIRECT offers a robust solution for pose-controllable object insertion. If your projects require precise 3D spatial alignment and high-fidelity appearance preservation, you should consider adopting its decomposed guidance strategy. This approach, which separates geometry, appearance, and context, overcomes the limitations of 2D inpainting and sparse 3D controls, ensuring realistic scene integration even with complex pose changes. Be mindful of the ethical implications regarding potential misuse for creating misleading visual content.

Key insights

DIRECT enables precise 3D-aware object insertion by decomposing appearance, geometry, and context guidance into independent generative pathways.

Principles

Decompose conditioning signals to prevent feature entanglement.
Utilize 3D visual proxies for explicit 6-DoF pose control.
Synthesize training data from single-view images for real-world diversity.

Method

The method lifts a 2D reference into a 3D proxy, renders it for 6-DoF geometry guidance, and injects this alongside appearance and global context via modality-specific LoRA adapters for high-fidelity 2D synthesis.

In practice

Employ RGB-based geometry guidance to resolve symmetric object orientation.
Use shape-decomposed mask augmentation to improve perceptual quality.
Train with progressive resolution for high-quality, diverse object geometries.

Topics

Object Insertion
3D Pose Control
Diffusion Models
Generative AI
Image Synthesis
Data Augmentation

Code references

black-forest-labs/flux

Best for: Research Scientist, AI Scientist, Computer Vision Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.