DynoSim: Simulating the Pareto Frontier

2026-05-29 · Source: NVIDIA Technical Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Advanced, long

Summary

DynoSim is a workload-driven discrete-event simulation designed for the NVIDIA Dynamo LLM serving stack, addressing the complexity of tuning interacting deployment choices. This Rust-based tool combines measured engine forward-pass timing, Mocker scheduler cores, Router, Planner behavior, KV cache effects, and workload traces on a virtual timeline. It achieves high fidelity and speed, simulating a 60.1-minute serving window in 2.41 seconds, approximately 1,500x faster than real time. DynoSim enables mapping Pareto frontiers for workloads and proposing algorithmic improvements to components like Router cost functions or cache policies. Its architecture composes workload replay, single-engine simulations with scheduler fidelity (e.g., vLLM, SGLang paths), and multi-engine simulations for system-level behaviors such as routing and distributed caching. Experiments demonstrate its utility in optimizing Planner settings, identifying optimal scaling intervals around 5-10 seconds, and quantifying the impact of cold-start times, revealing an SLA cliff at approximately 180 seconds for Qwen3-32B at TP=2.

Key takeaway

For MLOps Engineers optimizing LLM serving deployments, DynoSim offers a critical tool to rapidly explore configuration spaces. You should integrate this simulation into your workflow to screen thousands of deployment candidates and algorithmic changes, such as Router cost functions or cache policies, before committing GPU time. This approach allows you to identify optimal autoscaling intervals (e.g., 5-10 seconds) and understand cold-start time impacts (e.g., SLA cliff at ~180 seconds), significantly reducing validation costs and accelerating performance improvements.

Key insights

DynoSim provides a fast, high-fidelity discrete-event simulation for the NVIDIA Dynamo LLM serving stack, enabling rapid optimization and discovery of complex deployment configurations.

Principles

Discrete-event simulation models complex system interactions effectively.
Scheduler fidelity significantly impacts Time-to-First-Token (TTFT) accuracy.
KV-aware routing enhances prefix reuse and throughput in LLM serving.

Method

DynoSim uses a discrete-event simulation with a virtual clock, scheduling future events for components like load generators, routers, schedulers, and KV cache. It records request-level and system-level metrics from the simulated timeline.

In practice

Map Pareto frontiers for LLM serving workloads on existing hardware.
Optimize autoscaling intervals (e.g., 5-10 seconds) for responsiveness.
Quantify cold-start time impact on Service Level Agreement (SLA) adherence.

Topics

LLM Serving
Discrete-Event Simulation
NVIDIA Dynamo
Performance Optimization
Autoscaling
KV Cache
Router Policy

Code references

Best for: Machine Learning Engineer, NLP Engineer, MLOps Engineer, AI Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.