Take it to the limit: Peak prediction-driven resource overcommitment in datacenters
Last updated
Last updated
Presented in EuroSys 2021.
Authors: Noman Bashir, Nan Deng, Krzysztof Rzadca, David Irwin, Sree Kodak, Rohit Jnagal (University of Massachusetts Amherst & Google)
Code (simulator): https://github.com/googleinterns/cluster-resource-forecast
This paper focuses on the problem of resource overcommitment.
Assuming the complete knowledge of each task’s future resource usage, what is the safest overcommit policy that yields the highest utilization?
This work formalizes the overcommitment problem as the problem of predicting peak usage on each machine, which is complementary and orthogonal to scheduling problem.
They implement peak oracle in simulation (historical data owns complete knowledge) and use it to evaluate practical peak predictors.
The predictors should be lightweight and fast to compute.
borg-default
Inspired by Borg
peak = fraction of sum of limits (e.g., 90%)
RC-like
Inspired by Resource Central
peak = sum(x %ile of tasks usage)
N-sigma
Based on central limit theorem
peak = mean + N times STD (consider usage)
max(predictors)
peak = max(peaks across predictors)
Eventually, this paper chooses this predictor, which combines RC-like and N-sigma.
Propose a general methodology (peak oracle) for designing and evaluating overcommit policies.
Complementary and orthogonal to the cluster scheduling algorithm.
Oversubscribe serving tasks with other serving tasks.
Demonstrate that the max predictor policy is less risky and more efficient.