Direct 3D-Aware Object Insertion via Decomposed Visual Proxies
Summary
DIRECT (Decomposed Injection for Reference Composition and Target-integration) is a novel framework for 3D-aware object insertion, addressing the limitations of current diffusion-based methods that lack explicit 3D pose control. Published on 2026-06-04, DIRECT integrates interactive pose manipulation with high-fidelity 2D image synthesis, enabling users to control an object's 3D pose during insertion. The method decomposes insertion conditions into three distinct components: appearance guidance from the reference object, geometry guidance from a user-adjusted 3D proxy, and context guidance from the target background. These are injected through separate pathways to prevent feature entanglement, ensuring reference appearance preservation, accurate pose adherence, and seamless scene adaptation. Additionally, DIRECT incorporates an automated data construction pipeline to enhance training data diversity and quality. Experiments demonstrate its superior performance in both geometric controllability and visual quality compared to prior approaches.
Key takeaway
For computer vision engineers or 3D artists needing precise control over object placement in image synthesis, DIRECT offers a significant advancement. If your current diffusion-based insertion methods lack explicit 3D pose manipulation, you should explore this framework. It allows you to interactively adjust 3D proxies, ensuring objects are composited with exact pose and appearance preservation, overcoming the limitations of 2D inpainting. This could streamline workflows for virtual try-on, scene generation, or product visualization.
Key insights
DIRECT enables 3D-aware object insertion by decomposing guidance into appearance, geometry, and context, injected separately for precise control.
Principles
- Decomposing conditions avoids feature entanglement.
- Separate injection pathways preserve distinct attributes.
- User-adjusted 3D proxies enable explicit pose control.
Method
DIRECT integrates interactive pose manipulation with 2D image synthesis. It decomposes insertion conditions into appearance, geometry, and context guidance, injecting them via separate pathways. An automated data construction pipeline improves training.
In practice
- Insert objects with explicit 3D pose control.
- Preserve object appearance during scene integration.
- Adapt objects seamlessly to target backgrounds.
Topics
- 3D Object Insertion
- Diffusion Models
- Pose Control
- Image Synthesis
- Computer Vision
- Geometric Controllability
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.