Modality Forcing for Scalable Spatial Generation
Summary
Modality Forcing, a novel post-training recipe, enables conditional and joint image-depth generation using a single DiT model trained on sparse depth data. This approach leverages the rich spatial priors found in Text-to-Image (T2I) models, simplifying the process compared to prior works that demand dense depth data and complex recipes. By assigning separate noise levels per modality and employing per-modality decoders, Modality Forcing achieves strong, generalizable depth prediction. The authors demonstrated its scalability by training T2I models ranging from 370M to 3.3B parameters, observing that larger models trained on more image data produce more accurate depth. Their strongest model is competitive with state-of-the-art monocular depth estimators, reducing AbsRel by 57% relative to existing joint image-depth generative models, indicating T2I pre-training is a scalable objective for spatial perception.
Key takeaway
For Computer Vision Engineers developing scalable spatial perception models, Modality Forcing offers a simpler, more efficient approach. You can achieve competitive joint image-depth generation by post-training a single DiT model on sparse depth data, rather than relying on complex recipes or dense datasets. Consider utilizing large pre-trained T2I models, as this work demonstrates their scalability directly improves depth prediction accuracy, reducing AbsRel by 57% relative to existing models.
Key insights
Modality Forcing enables scalable, joint image-depth generation using sparse data and T2I models, outperforming prior methods.
Principles
- T2I models contain rich spatial priors.
- Larger T2I models yield more accurate depth.
- Image generation is a scalable pre-training objective.
Method
Modality Forcing is a post-training recipe for joint image-depth generation using a single DiT, assigning separate noise levels per modality and employing per-modality decoders for sparse data.
In practice
- Generate image and depth jointly or conditionally.
- Train on sparse, real-world depth datasets.
- Improve depth accuracy with larger T2I models.
Topics
- Modality Forcing
- Image-Depth Generation
- Text-to-Image Models
- Spatial Perception
- Diffusion Transformers
- Monocular Depth Estimation
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.