NVIDIA DGX Spark performance

· Source: Ollama Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Intermediate, quick

Summary

NVIDIA DGX Spark performance tests were conducted using release day firmware (580.95.05) and Ollama v0.12.6 to evaluate large language model inference. Tests involved running various models like gpt-oss, gemma3, llama3.1, deepseek-r1, and qwen3, with different quantizations (MXFP4, q4_K_M, q8_0). Each test was repeated 10 times, constrained to 500 tokens output, with temperature set to 0, and caching disabled. The prompt used was an in-depth summary of "A Tale of Two Cities." Performance metrics included prefill and decode tokens per second. For example, the 20B gpt-oss model achieved 3.224k prefill tokens/second and 58.27 decode tokens/second with MXFP4 quantization. The 120B gpt-oss model also fit entirely into the DGX Spark's 120GB VRAM.

Key takeaway

For NLP Engineers evaluating LLM deployment on NVIDIA DGX Spark, you should prioritize models with MXFP4 or q4_K_M quantization for optimal decode performance, especially for larger models. Consider the trade-offs between model size and tokens per second for your specific application needs. Ensure your DGX Spark firmware is updated to 580.95.05 or newer to benefit from the latest optimizations and stability.

Key insights

Ollama on NVIDIA DGX Spark delivers strong LLM inference performance across various models and quantizations.

Principles

Method

Performance testing involved 10 repetitions, fixed output tokens, zero temperature, and a specific text summarization prompt, with caching disabled to ensure consistent results.

In practice

Topics

Best for: NLP Engineer, AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Ollama Blog.