Designing GPU-Accelerated Query Engines with NVIDIA GQE

2026-06-30 · Source: NVIDIA Technical Blog · Field: Technology & Digital — Software Development & Engineering, Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, long

Summary

NVIDIA GQE is a reference architecture designed to accelerate SQL query execution on modern NVIDIA hardware, addressing memory and I/O bandwidth constraints. It leverages technologies like high bandwidth memory (HBM), NVLink-C2C, and the Blackwell Decompression Engine. GQE integrates NVIDIA cuDF, CCCL, nvCOMP, and nvSHMEM, processing Substrait plans through its query, data, and execution layers. Key optimizations include a GPU-friendly in-memory data layout, pipelined host-to-device transfers, hybrid compression using nvCOMP's Cascaded and LZ4 algorithms, and partition pruning via zone maps. Benchmarking GQE on a single NVIDIA GB200 GPU against DuckDB 1.4.1 on dual-socket AMD Turin Epyc 9755 CPUs for TPC-H SF1000 demonstrated a 7.5x aggregate speedup, outperforming DuckDB on 20 of 22 queries with gains up to 25.5x.

Key takeaway

For data engineers optimizing analytical data platforms, NVIDIA GQE demonstrates how targeted GPU optimizations can yield significant performance improvements. You should explore GQE's open-source reference architecture and its design principles, including hybrid compression and partition pruning, to minimize data transfer and maximize GPU utilization. This approach can achieve up to 7.5x speedups, fundamentally changing your approach to large-scale query execution.

Key insights

GPU-accelerated query engines achieve significant speedups by optimizing data movement, compression, and execution on NVIDIA hardware.

Principles

Overlap data transfer and computation.
Utilize dedicated decompression engines.
Prune irrelevant data pre-transfer.

Method

GQE's method involves parsing SQL into Substrait plans, generating a task graph for GPU execution, and orchestrating pipelined data transfers with hybrid compression and partition pruning.

In practice

Implement hybrid compression with LZ4 and Cascaded.
Store zone maps in GPU memory for pruning.
Use cudaMemcpyBatchAsync for batched transfers.

Topics

GPU Query Engines
NVIDIA GQE
Data Transfer Optimization
Hybrid Compression
Partition Pruning
TPC-H Benchmark

Code references

Best for: AI Engineer, Machine Learning Engineer, Data Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.