Shepherd: Serving DNNs in the wild
#model_serving_system #mixed-integer_linear_programming #workload_unpredictability
Meta Info
Presented in NSDI 2023.
Authors: Hong Zhang (UWaterloo), Yupeng Tang, Anurag Khandelwal (Yale), Ion Stoica (UC Berkeley)
Understanding the paper
TL;DR
This work presents Shepherd, a model serving system.
It uses a two-level design that decouples model serving into planning and serving modules.
Plan: aggregate request streams into moderately-sized groups.
Serve: employ an online algorithm; leverage preemption and model-specific batching
Challenges
Short-term workload unpredictability
The request arrival rates can be quite unpredictable at smaller time granularities (e.g., milliseconds).
Resource utilization v.s. scalability
(1) Make periodic provisioning and serving decisions for each user stream independently → Over-provisioning GPUs → Poor resource utilization
(2) Time-multiplex the GPU cluster across different user streams → Increased computational complexity when the number of the numbers of request streams and GPUs are large → Poor scalability
Designs
As indicated in Nexus (SOSP 2019), a simple linear model can accurately model the execution latency for varying batch sizes.
Periodic planner (Herd)
Use an Integer Linear Programming (ILP) to combine streams into serving groups to maximize the minimum burst tolerance across all streams.
Decision variables
Input parameters
Optimization Objective
Constraints
Cluster-size limit
Group-worker limit
GPU memory limit
Affinity-set surjectivity
No per-stream SLO constraint.
Online serving algorithm (Flex)
Objective: maximize the overall goodput.
Choose the largest feasible batch across all model priority queues (sorted by deadlines).
Preemption
Insert exit points between different DNN layers
Trade-off the preemption and execution delay overheads
Evaluation
Baselines
Clockwork (OSDI 2020)
Nexus (SOSP 2019)
Setup
Testbed: 12
p3.2xlarge
instances (8 vCPUs, 61GB RAM, 1 V100 GPU with 16GB memory)Emulation:
m4.16xlarge
instances (64 vCPUs, 256GB RAM)The request router, periodic planner, and online schedulers are deployed on separate
m4.16xlarge
instances.
Comments
Preemption
If preemption occurs, requests in preempted batch that can still meet their SLOs are re-enqueued to their corresponding priority queues. The re-enqueued requests will be treated as newly arrived requests so they can be scheduled again.
The preempted batch doesn't contribute to system throughput. This leads to wasted GPUs.
Dynamic model swapping
Shepherd only loads models onto GPU memory at the start of a planning period. An alternative solution is to dynamically swap models between GPU and CPU memory on-demand during online serving. However, since such swaps are likely to take much longer than serving a request, its cost must be weighed against the potential performance gains from swapping in a new model. We leave incorporating this decision as a part of online serving as future work.
It doesn't support to dynamically swap models between CPU and GPU memory.
Support large DNN models
If a DNN model is so large that it cannot be co-located with other models in GPU memory, Herd must place it in an isolated group with reduced degree of multiplexing. It is possible, however, to break such large models into smaller partitions to group them with other models for better multiplexing.
AlpaServe (OSDI 2023) seems to adopt this way by partitioning large models.
Last updated