NVIDIA DGX Spark performance
Summary
NVIDIA DGX Spark performance tests were conducted using release day firmware (580.95.05) and Ollama v0.12.6 to evaluate large language model inference. Tests involved running various models like gpt-oss, gemma3, llama3.1, deepseek-r1, and qwen3, with different quantizations (MXFP4, q4_K_M, q8_0). Each test was repeated 10 times, constrained to 500 tokens output, with temperature set to 0, and caching disabled. The prompt used was an in-depth summary of "A Tale of Two Cities." Performance metrics included prefill and decode tokens per second. For example, the 20B gpt-oss model achieved 3.224k prefill tokens/second and 58.27 decode tokens/second with MXFP4 quantization. The 120B gpt-oss model also fit entirely into the DGX Spark's 120GB VRAM.
Key takeaway
For NLP Engineers evaluating LLM deployment on NVIDIA DGX Spark, you should prioritize models with MXFP4 or q4_K_M quantization for optimal decode performance, especially for larger models. Consider the trade-offs between model size and tokens per second for your specific application needs. Ensure your DGX Spark firmware is updated to 580.95.05 or newer to benefit from the latest optimizations and stability.
Key insights
Ollama on NVIDIA DGX Spark delivers strong LLM inference performance across various models and quantizations.
Principles
- Quantization significantly impacts LLM inference speed.
- Larger models generally yield lower tokens/second.
- Caching should be disabled for consistent performance benchmarks.
Method
Performance testing involved 10 repetitions, fixed output tokens, zero temperature, and a specific text summarization prompt, with caching disabled to ensure consistent results.
In practice
- Use MXFP4 quantization for gpt-oss models on DGX Spark.
- Update DGX Spark firmware to 580.95.05 or newer.
- Integrate OpenAI's Codex with Ollama for coding tasks.
Topics
- NVIDIA DGX Spark
- LLM Performance Benchmarking
- Ollama
- Model Quantization
- OpenAI Codex
Best for: NLP Engineer, AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Ollama Blog.