cuNNQS-SCI: A Fully GPU-Accelerated Framework for High-Performance Configuration Interaction Selection with Neural Network Quantum States

2026-04-21 · Source: cs.AI updates on arXiv.org · Field: Science & Research — Physical Sciences & Chemistry, Mathematics & Computational Sciences, Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

cuNNQS-SCI is a fully GPU-accelerated framework designed to overcome scalability limitations in the Neural Network Quantum State-Selected Configuration Interaction (NNQS-SCI) method, a state-of-the-art technique for solving the Schrödinger equation. The original NNQS-SCI suffered from bottlenecks due to its hybrid CPU-GPU architecture, specifically centralized CPU-based global de-duplication and host-resident coupled-configuration generation. cuNNQS-SCI integrates a distributed, load-balanced global de-duplication algorithm, employs specialized CUDA kernels for exact coupled configuration generation, and incorporates a GPU memory-centric runtime with pooling, streaming mini-batches, and overlapped offloading. This design enables larger configuration spaces and shifts the performance bottleneck from host-side limitations to on-device inference. Evaluated on an NVIDIA A100 cluster with 64 GPUs, cuNNQS-SCI achieves up to 2.32x end-to-end speedup over the NNQS-SCI baseline while maintaining chemical accuracy and demonstrating over 90% parallel efficiency in strong scaling tests.

Key takeaway

For AI Engineers and Research Scientists working on large-scale quantum chemistry simulations, cuNNQS-SCI demonstrates that fully migrating complex workflows to GPUs, coupled with intelligent memory management and distributed algorithms, can yield substantial performance gains and enable previously intractable problem sizes. You should consider adopting similar GPU-centric architectural redesigns for non-AI components in your AI-for-Science applications to overcome CPU-side bottlenecks and maximize accelerator utilization.

Key insights

Fully GPU-accelerated quantum chemistry framework cuNNQS-SCI significantly boosts scalability and performance by eliminating CPU bottlenecks.

Principles

Migrate entire workflow to GPU to remove CPU bottlenecks.
Distribute global de-duplication to minimize communication.
Employ GPU memory-centric design for large datasets.

Method

cuNNQS-SCI uses a three-stage pipeline: massively parallel generation and global deduplication, batched inference and hierarchical selection, and energy calculation and network optimization, all executed on GPUs with host memory staging.

In practice

Utilize sort-based regular sampling for distributed de-duplication.
Design fine-grained CUDA kernels for bitwise operations.
Implement mini-batch processing with asynchronous offloading.

Topics

Neural Network Quantum States
Selected Configuration Interaction
GPU Acceleration
Distributed De-duplication
CUDA Kernels

Best for: AI Scientist, Research Scientist, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.