Running LLMs locally: Practical LLM Performance on DGX Spark — Mozhgan Kabiri chimeh, NVIDIA
Summary
Nvidia's Moska Gabricima presents practical LLM performance data on the Jetson Spark, a local AI development system powered by the GB10 Grace Blackwell superchip. The Jetson Spark features 128 GB of unified memory and NV4 support, enabling local operation of models up to 200 billion parameters using the same Nvidia AI software stack as production environments. Experiments utilized vLLM with quantized models in an Nvidia-optimized container, employing an automated benchmarking harness to measure completion tokens per second and time to first token. Key findings include the 14 billion NVFB4 model achieving 20.19 tokens per second, significantly outperforming the 14 billion base model's 8.40 tokens per second. The NVFB4 quantization also made the 14 billion model 3.4 times faster to first token than its unoptimized counterpart, highlighting the critical role of quantization in bridging memory capacity and bandwidth limitations.
Key takeaway
For AI Engineers and MLOps teams evaluating local LLM deployment, the Jetson Spark offers a powerful solution for rapid prototyping and privacy-sensitive data. Its ability to run large, quantized models locally with a consistent software stack minimizes cloud dependencies and accelerates iteration cycles. Consider leveraging NVFB4 quantization to maximize throughput and responsiveness, especially for models around 14 billion parameters, to achieve performance comparable to smaller models.
Key insights
Local LLM performance on Jetson Spark benefits significantly from 4-bit quantization for throughput and responsiveness.
Principles
- Quantization is critical for LLM performance on Blackwell hardware.
- Time to first token defines user-perceived application responsiveness.
Method
An automated benchmarking harness was used, involving Docker isolation, warm-up runs, and 1-second interval GPU metrics logging to ensure reproducible and verifiable performance measurements for LLMs.
In practice
- Use vLLM with quantized models for local LLM serving.
- Employ Docker for environment isolation in benchmarks.
Topics
- Jetson Spark
- LLM Performance
- NVFB4 Quantization
- vLLM
- Grace Blackwell Superchip
Best for: Machine Learning Engineer, AI Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AI Engineer.