Optimizing a Neural Reconstruction Pipeline Using NVIDIA Nsight Developer Tools
Summary
NVIDIA optimized its Omniverse NuRec neural reconstruction pipeline, which generates high-fidelity 3D representations from multisensor data for autonomous vehicle and robotics simulation. Using NVIDIA Nsight Developer Tools, specifically Nsight Systems and Nsight Compute, the team addressed significant computational costs to reduce reconstruction turnaround time. Initial reconstruction times, often exceeding an hour, were targeted for improvement towards a real-time goal. Nsight Systems identified bottlenecks in functions like `collect_gaussian_parameters` and `interpolate`, revealing numerous small CUDA kernels and inefficient `cudaStreamSynchronize` calls. Fusing kernels in `interpolate` achieved a nearly 50x speedup, reducing its runtime from 4.184 ms to 83.81 us. Nsight Compute then optimized the `renderBackward` kernel by splitting it for lidar and camera data, tuning register allocations and shared memory. This boosted GPU occupancy from ~15% to 30-50% and decreased the longest lidar kernel's runtime from 31 ms to 18 ms. The iterative optimization process continues, with ongoing work on workload imbalance.
Key takeaway
For Machine Learning Engineers optimizing complex GPU-accelerated pipelines like neural reconstruction, systematically applying NVIDIA Nsight Developer Tools is crucial. You should use Nsight Systems to identify high-level bottlenecks and CPU-GPU synchronization issues, then Nsight Compute for fine-grained kernel optimization. This iterative approach, including fusing small kernels and tailoring resource allocations for different data types, can significantly reduce processing times and infrastructure costs. Download Nsight Systems and Nsight Compute to begin optimizing your own workloads.
Key insights
Iterative performance optimization using NVIDIA Nsight tools identifies and resolves GPU bottlenecks for significant speedups.
Principles
- GPU underutilization often stems from CPU-side bottlenecks or many small kernels.
- Kernel behavior can differ significantly based on input data types.
- Static resource allocation can be inefficient for varied workloads.
Method
Profile with Nsight Systems for platform bottlenecks, use NVTX for function-level detail, fuse small kernels, remove synchronization points, then use Nsight Compute for kernel-specific tuning like splitting kernels and optimizing resource allocation.
In practice
- Profile PyTorch training loops using Nsight Systems and NVTX.
- Analyze `cudaStreamSynchronize` calls for CPU-GPU stalls.
- Split CUDA kernels by data type; tune `launch_bounds` and `cudaFuncSetCacheConfig`.
Topics
- NVIDIA Omniverse NuRec
- Neural Reconstruction
- GPU Performance Optimization
- Nsight Developer Tools
- CUDA Kernel Optimization
- Autonomous Vehicle Simulation
Code references
Best for: Machine Learning Engineer, AI Engineer, Robotics Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.