Running LLMs locally: Practical LLM Performance on DGX Spark — Mozhgan Kabiri chimeh, NVIDIA

· Source: AI Engineer · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Intermediate, medium

Summary

Nvidia's Moska Gabricima presents practical LLM performance data on the Jetson Spark, a local AI development system powered by the GB10 Grace Blackwell superchip. The Jetson Spark features 128 GB of unified memory and NV4 support, enabling local operation of models up to 200 billion parameters using the same Nvidia AI software stack as production environments. Experiments utilized vLLM with quantized models in an Nvidia-optimized container, employing an automated benchmarking harness to measure completion tokens per second and time to first token. Key findings include the 14 billion NVFB4 model achieving 20.19 tokens per second, significantly outperforming the 14 billion base model's 8.40 tokens per second. The NVFB4 quantization also made the 14 billion model 3.4 times faster to first token than its unoptimized counterpart, highlighting the critical role of quantization in bridging memory capacity and bandwidth limitations.

Key takeaway

For AI Engineers and MLOps teams evaluating local LLM deployment, the Jetson Spark offers a powerful solution for rapid prototyping and privacy-sensitive data. Its ability to run large, quantized models locally with a consistent software stack minimizes cloud dependencies and accelerates iteration cycles. Consider leveraging NVFB4 quantization to maximize throughput and responsiveness, especially for models around 14 billion parameters, to achieve performance comparable to smaller models.

Key insights

Local LLM performance on Jetson Spark benefits significantly from 4-bit quantization for throughput and responsiveness.

Principles

Method

An automated benchmarking harness was used, involving Docker isolation, warm-up runs, and 1-second interval GPU metrics logging to ensure reproducible and verifiable performance measurements for LLMs.

In practice

Topics

Best for: Machine Learning Engineer, AI Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI Engineer.