cuNNQS-SCI: A Fully GPU-Accelerated Framework for High-Performance Configuration Interaction Selection with Neural Network Quantum States
Summary
cuNNQS-SCI is a fully GPU-accelerated framework designed to overcome scalability limitations in the Neural Network Quantum State-Selected Configuration Interaction (NNQS-SCI) method, a state-of-the-art technique for solving the Schrödinger equation. The original NNQS-SCI suffered from bottlenecks due to its hybrid CPU-GPU architecture, specifically centralized CPU-based global de-duplication and host-resident coupled-configuration generation. cuNNQS-SCI integrates a distributed, load-balanced global de-duplication algorithm, employs specialized CUDA kernels for exact coupled configuration generation, and incorporates a GPU memory-centric runtime with pooling, streaming mini-batches, and overlapped offloading. This design enables larger configuration spaces and shifts the performance bottleneck from host-side limitations to on-device inference. Evaluated on an NVIDIA A100 cluster with 64 GPUs, cuNNQS-SCI achieves up to 2.32x end-to-end speedup over the NNQS-SCI baseline while maintaining chemical accuracy and demonstrating over 90% parallel efficiency in strong scaling tests.
Key takeaway
For AI Engineers and Research Scientists working on large-scale quantum chemistry simulations, cuNNQS-SCI demonstrates that fully migrating complex workflows to GPUs, coupled with intelligent memory management and distributed algorithms, can yield substantial performance gains and enable previously intractable problem sizes. You should consider adopting similar GPU-centric architectural redesigns for non-AI components in your AI-for-Science applications to overcome CPU-side bottlenecks and maximize accelerator utilization.
Key insights
Fully GPU-accelerated quantum chemistry framework cuNNQS-SCI significantly boosts scalability and performance by eliminating CPU bottlenecks.
Principles
- Migrate entire workflow to GPU to remove CPU bottlenecks.
- Distribute global de-duplication to minimize communication.
- Employ GPU memory-centric design for large datasets.
Method
cuNNQS-SCI uses a three-stage pipeline: massively parallel generation and global deduplication, batched inference and hierarchical selection, and energy calculation and network optimization, all executed on GPUs with host memory staging.
In practice
- Utilize sort-based regular sampling for distributed de-duplication.
- Design fine-grained CUDA kernels for bitwise operations.
- Implement mini-batch processing with asynchronous offloading.
Topics
- Neural Network Quantum States
- Selected Configuration Interaction
- GPU Acceleration
- Distributed De-duplication
- CUDA Kernels
Best for: AI Scientist, Research Scientist, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.