HorizonForge: Driving Scene Editing with Any Trajectories and Any Vehicles
Summary
HorizonForge is a unified framework for generating photorealistic and controllable driving scenes, addressing the challenge of jointly achieving realism and precise control in autonomous driving simulation. It reconstructs scenes using editable Gaussian Splats and Meshes, enabling fine-grained 3D manipulation and language-driven vehicle insertion. The framework renders edits through a noise-aware video diffusion process, ensuring spatial and temporal consistency in a single feed-forward pass, thus avoiding per-trajectory optimization. The authors also introduce HorizonSuite, a comprehensive benchmark for evaluating ego- and agent-level editing tasks, including trajectory modifications and object manipulation. Extensive experiments demonstrate that HorizonForge achieves an 83.4% user-preference gain and a 25.19% FID improvement over the second-best method, validating the superior fidelity of its Gaussian-Mesh representation and the necessity of temporal priors from video diffusion for coherent synthesis.
Key takeaway
For AI Scientists and Research Scientists developing autonomous driving simulation, HorizonForge offers a robust framework for generating highly realistic and controllable driving scenarios. Its use of 3D Gaussian Splats and video diffusion models significantly enhances visual fidelity and temporal consistency, allowing for precise manipulation of ego and agent trajectories. You should consider integrating similar 3D representation and video diffusion techniques to improve the realism and controllability of your simulation environments, especially for evaluating long-tail and safety-critical events.
Key insights
HorizonForge enables photorealistic, controllable driving scene generation using 3D Gaussian Splats and meshes with video diffusion.
Principles
- 3D Gaussian Splats encode richer appearance cues for accurate edits.
- Temporal priors from video diffusion are essential for coherent synthesis.
Method
HorizonForge reconstructs scenes into editable 3D Gaussian Splats and meshes, then renders edits via a noise-aware video diffusion model to ensure spatio-temporal consistency and support language-guided object insertion.
In practice
- Use Gaussian Splats for high-fidelity 3D scene representation.
- Employ video diffusion models for temporal consistency in video generation.
Topics
- Driving Scene Generation
- Gaussian Splatting
- Video Diffusion Models
- Autonomous Driving Simulation
- 3D Object Editing
Code references
Best for: AI Scientist, Research Scientist, AI Researcher, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.