EfficientPENet: Real-Time Depth Completion from Sparse LiDAR via Lightweight Multi-Modal Fusion
Summary
EfficientPENet is a novel two-branch depth completion network designed for real-time operation on embedded hardware, addressing the challenge of generating dense depth maps from sparse LiDAR and RGB images. It replaces the traditional ResNet encoder with a modernized ConvNeXt backbone, incorporates sparsity-invariant convolutions for the depth stream, and refines predictions using a Convolutional Spatial Propagation Network (CSPN). The RGB branch utilizes ImageNet-pretrained ConvNeXt blocks with Layer Normalization, 7x7 depthwise convolutions, and stochastic depth regularization. Features are merged via late fusion and decoded with multi-scale deep supervision. EfficientPENet achieves an RMSE of 631.94 mm on the KITTI depth completion benchmark with 36.24M parameters and a latency of 20.51 ms (48.76 FPS). This represents a 3.7x reduction in parameters and a 23x speedup compared to BP-Net, while maintaining competitive accuracy, making it suitable for resource-constrained edge platforms like the NVIDIA Jetson.
Key takeaway
For AI Scientists developing real-time 3D perception systems on embedded platforms, EfficientPENet demonstrates that modern ConvNet architectures can deliver high accuracy and throughput. You should consider adopting ConvNeXt backbones and sparsity-invariant convolutions to meet strict latency and parameter constraints, especially for applications like robotic inspection where near-field accuracy is critical. This approach enables robust depth completion without relying on heavy, computationally expensive models.
Key insights
EfficientPENet achieves real-time depth completion on edge devices by combining a lightweight ConvNeXt backbone with sparsity-invariant convolutions and CSPN refinement.
Principles
- Modernized ConvNets can match Transformer accuracy with better efficiency.
- Sparsity-invariant convolutions prevent feature corruption from missing data.
- Position-aware TTA improves accuracy for coordinate-encoded networks.
Method
EfficientPENet uses a two-branch encoder-decoder: ConvNeXt for RGB, sparsity-invariant convolutions for depth. It fuses features late, applies multi-scale deep supervision, and refines output with CSPN.
In practice
- Use ConvNeXt for efficient RGB feature extraction on edge devices.
- Implement sparsity-invariant convolutions for sparse sensor data.
- Apply position-aware TTA for coordinate-encoded models to reduce RMSE.
Topics
- Depth Completion
- ConvNeXt Architecture
- Sparse LiDAR Processing
- Multi-Modal Sensor Fusion
- Real-Time Embedded Systems
Best for: Computer Vision Engineer, AI Scientist, Research Scientist, Robotics Engineer, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.