Optimizing a Neural Reconstruction Pipeline Using NVIDIA Nsight Developer Tools

2026-06-30 · Source: NVIDIA Technical Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Robotics & Autonomous Systems · Depth: Expert, medium

Summary

NVIDIA optimized its Omniverse NuRec neural reconstruction pipeline, which generates high-fidelity 3D representations from multisensor data for autonomous vehicle and robotics simulation. Using NVIDIA Nsight Developer Tools, specifically Nsight Systems and Nsight Compute, the team addressed significant computational costs to reduce reconstruction turnaround time. Initial reconstruction times, often exceeding an hour, were targeted for improvement towards a real-time goal. Nsight Systems identified bottlenecks in functions like `collect_gaussian_parameters` and `interpolate`, revealing numerous small CUDA kernels and inefficient `cudaStreamSynchronize` calls. Fusing kernels in `interpolate` achieved a nearly 50x speedup, reducing its runtime from 4.184 ms to 83.81 us. Nsight Compute then optimized the `renderBackward` kernel by splitting it for lidar and camera data, tuning register allocations and shared memory. This boosted GPU occupancy from ~15% to 30-50% and decreased the longest lidar kernel's runtime from 31 ms to 18 ms. The iterative optimization process continues, with ongoing work on workload imbalance.

Key takeaway

For Machine Learning Engineers optimizing complex GPU-accelerated pipelines like neural reconstruction, systematically applying NVIDIA Nsight Developer Tools is crucial. You should use Nsight Systems to identify high-level bottlenecks and CPU-GPU synchronization issues, then Nsight Compute for fine-grained kernel optimization. This iterative approach, including fusing small kernels and tailoring resource allocations for different data types, can significantly reduce processing times and infrastructure costs. Download Nsight Systems and Nsight Compute to begin optimizing your own workloads.

Key insights

Iterative performance optimization using NVIDIA Nsight tools identifies and resolves GPU bottlenecks for significant speedups.

Principles

GPU underutilization often stems from CPU-side bottlenecks or many small kernels.
Kernel behavior can differ significantly based on input data types.
Static resource allocation can be inefficient for varied workloads.

Method

Profile with Nsight Systems for platform bottlenecks, use NVTX for function-level detail, fuse small kernels, remove synchronization points, then use Nsight Compute for kernel-specific tuning like splitting kernels and optimizing resource allocation.

In practice

Profile PyTorch training loops using Nsight Systems and NVTX.
Analyze `cudaStreamSynchronize` calls for CPU-GPU stalls.
Split CUDA kernels by data type; tune `launch_bounds` and `cudaFuncSetCacheConfig`.

Topics

NVIDIA Omniverse NuRec
Neural Reconstruction
GPU Performance Optimization
Nsight Developer Tools
CUDA Kernel Optimization
Autonomous Vehicle Simulation

Code references

NVIDIA/NVTX

Best for: Machine Learning Engineer, AI Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.