DataCenterGym: A Physics-Grounded Simulator for Multi-Objective Data Center Scheduling

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

DataCenterGym is a physics-grounded simulation environment designed for job scheduling in geo-distributed data centers, addressing the complex interplay between compute utilization, heat generation, cooling demand, and energy consumption. This reusable testbed integrates compute queueing, building thermal dynamics, localized HVAC behavior, and temperature-dependent service degradation within a Gymnasium-compatible interface. The simulator enables evaluation of scheduling policies under realistic thermal and power constraints, using production-scale workload traces like the Alibaba 2018 cluster trace. The authors also developed a Hierarchical Model Predictive Control (H-MPC) algorithm that performs distributed job placement while explicitly accounting for thermal and power dynamics, demonstrating improved scheduling performance compared to baseline schedulers in experiments across nominal operation and workload sensitivity.

Key takeaway

Research Scientists developing data center scheduling algorithms should utilize DataCenterGym to evaluate policies under realistic, coupled thermal and power dynamics. This framework allows for principled assessment of trade-offs among throughput, latency, thermal safety, and energy efficiency, without requiring access to production infrastructure. Focus on hierarchical control strategies like H-MPC to manage complex interactions and achieve anticipatory thermal control, thereby expanding the system's safe operating envelope under increasing load.

Key insights

DataCenterGym simulates geo-distributed data center scheduling, integrating thermal, power, and workload dynamics for comprehensive policy evaluation.

Principles

Method

DataCenterGym models online job allocation as a discrete-time stochastic environment. H-MPC decomposes control into datacenter-level supervisory MPC (admission, thermal setpoints) and cluster-level scheduling MPC (job allocation) to manage hybrid action spaces and scale.

In practice

Topics

Best for: Research Scientist, AI Scientist, MLOps Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.