EfficientPENet: Real-Time Depth Completion from Sparse LiDAR via Lightweight Multi-Modal Fusion

2026-04-22 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Advanced, extended

Summary

EfficientPENet is a novel two-branch depth completion network designed for real-time operation on embedded hardware, addressing the challenge of generating dense depth maps from sparse LiDAR and RGB images. It replaces the traditional ResNet encoder with a modernized ConvNeXt backbone, incorporates sparsity-invariant convolutions for the depth stream, and refines predictions using a Convolutional Spatial Propagation Network (CSPN). The RGB branch utilizes ImageNet-pretrained ConvNeXt blocks with Layer Normalization, 7x7 depthwise convolutions, and stochastic depth regularization. Features are merged via late fusion and decoded with multi-scale deep supervision. EfficientPENet achieves an RMSE of 631.94 mm on the KITTI depth completion benchmark with 36.24M parameters and a latency of 20.51 ms (48.76 FPS). This represents a 3.7x reduction in parameters and a 23x speedup compared to BP-Net, while maintaining competitive accuracy, making it suitable for resource-constrained edge platforms like the NVIDIA Jetson.

Key takeaway

For AI Scientists developing real-time 3D perception systems on embedded platforms, EfficientPENet demonstrates that modern ConvNet architectures can deliver high accuracy and throughput. You should consider adopting ConvNeXt backbones and sparsity-invariant convolutions to meet strict latency and parameter constraints, especially for applications like robotic inspection where near-field accuracy is critical. This approach enables robust depth completion without relying on heavy, computationally expensive models.

Key insights

EfficientPENet achieves real-time depth completion on edge devices by combining a lightweight ConvNeXt backbone with sparsity-invariant convolutions and CSPN refinement.

Principles

Modernized ConvNets can match Transformer accuracy with better efficiency.
Sparsity-invariant convolutions prevent feature corruption from missing data.
Position-aware TTA improves accuracy for coordinate-encoded networks.

Method

EfficientPENet uses a two-branch encoder-decoder: ConvNeXt for RGB, sparsity-invariant convolutions for depth. It fuses features late, applies multi-scale deep supervision, and refines output with CSPN.

In practice

Use ConvNeXt for efficient RGB feature extraction on edge devices.
Implement sparsity-invariant convolutions for sparse sensor data.
Apply position-aware TTA for coordinate-encoded models to reduce RMSE.

Topics

Depth Completion
ConvNeXt Architecture
Sparse LiDAR Processing
Multi-Modal Sensor Fusion
Real-Time Embedded Systems

Best for: Computer Vision Engineer, AI Scientist, Research Scientist, Robotics Engineer, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.