Shepherd: Serving DNNs in the wild

#model_serving_system #mixed-integer_linear_programming #workload_unpredictability

Meta Info

Presented in NSDI 2023.

Authors: Hong Zhang (UWaterloo), Yupeng Tang, Anurag Khandelwal (Yale), Ion Stoica (UC Berkeley)

Understanding the paper

TL;DR

  • This work presents Shepherd, a model serving system.

  • It uses a two-level design that decouples model serving into planning and serving modules.

    • Plan: aggregate request streams into moderately-sized groups.

    • Serve: employ an online algorithm; leverage preemption and model-specific batching

Challenges

  • Short-term workload unpredictability

    • The request arrival rates can be quite unpredictable at smaller time granularities (e.g., milliseconds).

  • Resource utilization v.s. scalability

    • (1) Make periodic provisioning and serving decisions for each user stream independently → Over-provisioning GPUs → Poor resource utilization

    • (2) Time-multiplex the GPU cluster across different user streams → Increased computational complexity when the number of the numbers of request streams and GPUs are large → Poor scalability

Designs

  • As indicated in Nexus (SOSP 2019), a simple linear model can accurately model the execution latency for varying batch sizes.

    • lm=αm⋅∣B∣+βm{l}_{m} = \alpha_m \cdot | B | + \beta_m

    • BB: the batch size.

    • αm\alpha_m: the latency for each additional request in the batch.

    • βm\beta_m: the baseline execution latency for executing an empty batch on the model.

Periodic planner (Herd)

  1. Each request stream has an average load ratei{rate}_i

  2. Measure the maximum goodput TiT_i, i.e., each stream ii can achieve on a single GPU

  3. Compute ni=rateiTin_i = \frac{{rate}_i}{T_i}


Use an Integer Linear Programming (ILP) to combine streams into serving groups to maximize the minimum burst tolerance across all streams.

Burst tolerance metric: bt(i)=∑jsizej⋅xijnibt(i) = \sum_{j}\frac{{size_j \cdot {x}_{ij}}}{{n}_{i}}

  • Decision variables

    • xij∈{0,1}x_{ij} \in \{0,1\}: Is stream ii mapped to group jj?

    • ycj∈{0,1}y_{cj} \in \{0,1\}: Is affinity-set cc mapped to group jj?

    • zkj∈{0,1}z_{kj} \in \{0,1\}: Is model kk mapped to group jj?

    • sizejsize_j: # of GPUs allocated to group jj?

  • Input parameters

    • memmem: GPU memory capacity

    • GG: Scalability limit for # of GPUs per group

    • NN: # of GPUs in cluster

    • hki∈{0,1}h_{ki} \in \{0,1\}: Does stream ii use model kk?

    • qck∈{0,1}q_{ck} \in \{0,1\}: Does affinity-set cc include model kk?

  • Optimization Objective

    • maximize mini{bt(i)}\text{maximize}\ min_{i}{\{bt(i)\}}

  • Constraints

    • Cluster-size limit

      • ∑jsizej≤N\sum_{j}{{size}_j} \le N

    • Group-worker limit

      • sizej≤G{size}_j \le G

    • GPU memory limit

      • ∑kzkjâ‹…mk≤mem\sum_{k}{z_{kj}} \cdot {m_k} \le mem

    • Group surjectivity (every stream ii is assigned to a single group jj and only if its associated model is also assigned to group jj)

      • ∑jxij=1,,∀i\sum_{j}{x_{ij}} = 1, , \forall {i}

      • hkiâ‹…xij≤zkj,∀i,j,kh_{ki} \cdot x_{ij} \le z_{kj}, \forall {i,j,k}

        • Make sure if stream ii is mapped to group jj and stream ii uses model kk, model kk must be mapped to group jj.

    • Affinity-set surjectivity

      • ∑cycj=1,∀j\sum_{c}{y_{cj}} = 1, \forall j

      • qckâ‹…zkj≤ycj,∀i,j,kq_{ck} \cdot z_{kj} \le y_{cj}, \forall {i,j,k}

    • No per-stream SLO constraint.

Online serving algorithm (Flex)

  1. Each request rr has an arrival time ara_r, deadline drd_r, queries model mrm_r.

  2. For a batch BB

    1. Arrival time a(B)a(B) is the arrival time of the most recent request in BB

    2. Deadline d(B)d(B) is the earliest deadline of all requests in BB

Objective: maximize the overall goodput.

  • Choose the largest feasible batch across all model priority queues (sorted by deadlines).

  • Preempt the current batch if the generated batch is λ\lambda times larger than currently running batch.

  • Preemption

    • Insert exit points between different DNN layers

    • Trade-off the preemption and execution delay overheads

Evaluation

  • Baselines

    • Clockwork (OSDI 2020)

    • Nexus (SOSP 2019)

  • Setup

    • Testbed: 12 p3.2xlarge instances (8 vCPUs, 61GB RAM, 1 V100 GPU with 16GB memory)

    • Emulation: m4.16xlarge instances (64 vCPUs, 256GB RAM)

    • The request router, periodic planner, and online schedulers are deployed on separate m4.16xlarge instances.

Comments

Preemption

If preemption occurs, requests in preempted batch that can still meet their SLOs are re-enqueued to their corresponding priority queues. The re-enqueued requests will be treated as newly arrived requests so they can be scheduled again.

The preempted batch doesn't contribute to system throughput. This leads to wasted GPUs.

Dynamic model swapping

Shepherd only loads models onto GPU memory at the start of a planning period. An alternative solution is to dynamically swap models between GPU and CPU memory on-demand during online serving. However, since such swaps are likely to take much longer than serving a request, its cost must be weighed against the potential performance gains from swapping in a new model. We leave incorporating this decision as a part of online serving as future work.

It doesn't support to dynamically swap models between CPU and GPU memory.

Support large DNN models

If a DNN model is so large that it cannot be co-located with other models in GPU memory, Herd must place it in an isolated group with reduced degree of multiplexing. It is possible, however, to break such large models into smaller partitions to group them with other models for better multiplexing.

AlpaServe (OSDI 2023) seems to adopt this way by partitioning large models.

Last updated

Was this helpful?