Scheduling in a changing world: Maximizing throughput with time-varying capacity
Summary
Google Research introduced new, provably effective algorithms for non-preemptive job scheduling on cloud infrastructure with time-varying machine availability, a common scenario in large-scale cloud computing. The research, presented at SPAA 2025, addresses the problem of maximizing throughput (total weight or number of successful jobs) when resources fluctuate due to hardware failure, maintenance, or high-priority tasks claiming capacity. For the offline setting, where future capacity is known, a Greedy strategy achieved a 1/2-approximation for unit profit jobs and a 1/4-approximation using a primal-dual framework for varying profits. In the more complex online setting, where jobs arrive dynamically, standard non-preemptive algorithms fail. The team developed a variant of Greedy for "interruption with restarts" that also achieves a 1/2-competitive ratio. For the stricter "interruption without restarts" model, a novel algorithm achieved a 1/11 competitive ratio for common deadline instances, representing the first constant competitive ratio for this problem.
Key takeaway
For AI Scientists designing or optimizing cloud resource schedulers, this research highlights the necessity of moving beyond static capacity assumptions. You should evaluate your current scheduling algorithms against the challenges of time-varying capacity, particularly for non-preemptive batch jobs. Consider implementing algorithms that allow for job restarts to significantly improve throughput in dynamic online environments, or explore the proposed constant competitive ratio algorithms for stricter "interruption without restarts" scenarios, especially when jobs share common deadlines.
Key insights
New algorithms optimize non-preemptive job scheduling in dynamic cloud environments with fluctuating machine capacity.
Principles
- Greedy strategies can achieve strong approximation guarantees.
- Flexibility in job interruption significantly improves online scheduling.
- Constant-factor approximations are crucial for NP-hard problems.
Method
The online algorithm for "interruption without restarts" maintains a tentative schedule, modifying it by adding, replacing, interrupting, or discarding jobs based on arrival and current state, particularly for common deadline scenarios.
In practice
- Prioritize jobs with common deadlines for online scheduling.
- Consider job restart flexibility for better throughput.
- Implement Greedy-based heuristics for offline planning.
Topics
- Time-Varying Capacity Scheduling
- Non-Preemptive Job Scheduling
- Throughput Maximization
- Approximation Algorithms
- Online Algorithms
Best for: AI Scientist, AI Researcher, Research Scientist, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The latest research from Google.