The RTX 5080 Is 3x Faster Than NVIDIA’s $3,999 “AI Supercomputer” — And It Doesn’t Matter

· Source: NLP on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, medium

Summary

NVIDIA's $3,999 DGX Spark, marketed as an "AI supercomputer" with a GB10 Grace Blackwell Superchip and 128GB unified memory, was benchmarked against consumer GPUs using a GPT-OSS 20B model via Ollama with 50 concurrent requests. While the DGX Spark completed all requests with rock-solid stability, its wall time was 18,349.6 seconds (~5.1 hours), yielding an average latency of ~18,350 seconds per request and processing about 50 tokens/sec. This is significantly slower than an RTX 5090 (~213 tokens/sec) or an RTX 5080 (~155 tokens/sec). The performance bottleneck is identified as its 273 GB/s shared LPDDR5x memory bandwidth, which is 3.5x lower than an RTX 5080's 960 GB/s dedicated GDDR7. However, the DGX Spark's key advantage is its 128GB memory capacity, enabling it to load 70B-200B parameter models locally without quantization, a feat impossible for consumer GPUs like the RTX 5080 (16GB) or RTX 5090 (32GB). The DGX Spark also offers superior power efficiency (4W idle, 170W peak system draw) and comes as a complete, pre-configured system.

Key takeaway

For AI Engineers evaluating hardware for local LLM deployment, recognize that the DGX Spark excels in memory capacity for larger models (70B-200B parameters) and power efficiency, making it suitable for edge or on-site installations requiring data privacy. If your models fit within 16-32GB VRAM and demand high token generation speed, consumer GPUs like the RTX 5080 or 5090 remain superior. Align your hardware choice with model size requirements and operational constraints, not just raw inference speed claims.

Key insights

Memory capacity, not raw inference speed, defines the DGX Spark's primary value proposition.

Principles

Method

Benchmark LLM inference performance by running 50 concurrent requests of GPT-OSS 20B via Ollama, measuring wall time, latency, and tokens/sec, then compare against memory bandwidth specifications.

In practice

Topics

Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NLP on Medium.