UC San Diego Lab Advances Generative AI Research With NVIDIA DGX B200 System

2025-12-17 · Source: NVIDIA Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Intermediate, short

Summary

The Hao AI Lab at the University of California San Diego received an NVIDIA DGX B200 system to advance its large language model (LLM) inference research. This system, one of NVIDIA's most powerful AI platforms, is accelerating projects like FastVideo, which trains video generation models to produce five-second videos from text prompts in five seconds, and Lmgame-bench, a benchmarking suite for LLMs using popular online games. The lab's work also focuses on achieving low-latency LLM serving. A key concept from Hao AI Lab, DistServe, influenced disaggregated inference, a method used in platforms like NVIDIA Dynamo. DistServe introduced "goodput" as a metric, measuring throughput while satisfying user-specified latency objectives, which is a more comprehensive performance indicator than traditional throughput alone.

Key takeaway

For AI scientists focused on optimizing large language model inference, understanding and implementing disaggregated inference is crucial. By separating prefill and decode operations onto distinct GPUs, you can significantly improve "goodput"—a metric that balances high throughput with low user-perceived latency. This approach, exemplified by NVIDIA Dynamo, enables continuous workload scaling without compromising model responsiveness or quality, directly impacting the efficiency and cost-effectiveness of your LLM deployments.

Key insights

Disaggregated inference, optimizing "goodput" by separating prefill and decode, enhances LLM serving efficiency and user experience.

Principles

Goodput balances throughput and latency.
Disaggregation eliminates resource contention.

Method

Split LLM inference into prefill (compute-intensive) and decode (memory-intensive) stages, running them on different GPUs to maximize "goodput" and reduce latency.

In practice

Utilize NVIDIA DGX B200 for rapid prototyping.
Employ disaggregated inference for LLM serving.
Benchmark LLMs with Lmgame-bench suite.

Topics

Large Language Model Inference
NVIDIA DGX B200
LLM Serving Optimization
DistServe
Video Generation Models

Best for: AI Scientist, Research Scientist, AI Researcher, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Blog.