NVIDIA DGX Spark performance

2025-10-22 · Source: Ollama Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Intermediate, quick

Summary

NVIDIA DGX Spark performance tests were conducted using release day firmware (580.95.05) and Ollama v0.12.6 to evaluate large language model inference. Tests involved running various models like gpt-oss, gemma3, llama3.1, deepseek-r1, and qwen3, with different quantizations (MXFP4, q4_K_M, q8_0). Each test was repeated 10 times, constrained to 500 tokens output, with temperature set to 0, and caching disabled. The prompt used was an in-depth summary of "A Tale of Two Cities." Performance metrics included prefill and decode tokens per second. For example, the 20B gpt-oss model achieved 3.224k prefill tokens/second and 58.27 decode tokens/second with MXFP4 quantization. The 120B gpt-oss model also fit entirely into the DGX Spark's 120GB VRAM.

Key takeaway

For NLP Engineers evaluating LLM deployment on NVIDIA DGX Spark, you should prioritize models with MXFP4 or q4_K_M quantization for optimal decode performance, especially for larger models. Consider the trade-offs between model size and tokens per second for your specific application needs. Ensure your DGX Spark firmware is updated to 580.95.05 or newer to benefit from the latest optimizations and stability.

Key insights

Ollama on NVIDIA DGX Spark delivers strong LLM inference performance across various models and quantizations.

Principles

Quantization significantly impacts LLM inference speed.
Larger models generally yield lower tokens/second.
Caching should be disabled for consistent performance benchmarks.

Method

Performance testing involved 10 repetitions, fixed output tokens, zero temperature, and a specific text summarization prompt, with caching disabled to ensure consistent results.

In practice

Use MXFP4 quantization for gpt-oss models on DGX Spark.
Update DGX Spark firmware to 580.95.05 or newer.
Integrate OpenAI's Codex with Ollama for coding tasks.

Topics

NVIDIA DGX Spark
LLM Performance Benchmarking
Ollama
Model Quantization
OpenAI Codex

Best for: NLP Engineer, AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Ollama Blog.