Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform

2026-03-16 · Source: NVIDIA Technical Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Robotics & Autonomous Systems · Depth: Advanced, long

Summary

NVIDIA has introduced the Groq 3 LPX, a new rack-scale inference accelerator designed for the NVIDIA Vera Rubin platform, specifically targeting low-latency and large-context agentic systems. Co-designed with the NVIDIA Vera Rubin NVL72, LPX optimizes for fast, predictable token generation, while the NVL72 handles general-purpose training and high-throughput inference, including long-context processing. This heterogeneous architecture combines 256 interconnected NVIDIA Groq 3 LPU accelerators, emphasizing deterministic execution, high on-chip SRAM bandwidth (150 TB/s), and tightly coordinated scale-up communication. The LPX system, integrated with NVIDIA MGX ETL rack architecture, delivers up to 35x higher inference throughput per megawatt and up to 10x more revenue opportunity for trillion-parameter models compared to prior systems. It is managed by NVIDIA Dynamo software, which orchestrates request routing and disaggregated decode across heterogeneous backends for optimal performance.

Key takeaway

For MLOps engineers building interactive AI applications or agentic systems, adopting a heterogeneous inference architecture like NVIDIA's Vera Rubin NVL72 with Groq 3 LPX is crucial. This approach allows you to achieve both high AI factory throughput and critical low-latency responsiveness, enabling advanced real-time collaboration and multi-agent workflows that were previously unfeasible. Prioritize optimizing for a range of real-world operating points rather than a single headline metric to maximize system value and user experience.

Key insights

Heterogeneous AI inference architectures combining GPUs and LPUs significantly boost both throughput and low-latency responsiveness.

Principles

Deterministic execution reduces latency jitter.
Explicit data movement optimizes memory access.
Disaggregated decode improves interactive responsiveness.

Method

The NVIDIA Groq 3 LPU architecture tightly couples compute, memory, and communication under compiler control, using 320-byte vectors for operations and a flat, SRAM-first memory (500 MB on-chip SRAM) for primary working storage.

In practice

Use LPX for latency-sensitive decode loops.
Employ Rubin GPUs for prefill and decode attention.
Implement speculative decoding with LPX for draft generation.

Topics

NVIDIA Groq 3 LPX
Vera Rubin Platform
Agentic AI
Heterogeneous Inference
Speculative Decoding

Best for: MLOps Engineer, CTO, VP of Engineering/Data, AI Engineer, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.