Shepherd: Serving DNNs in the wild

#model_serving_system #mixed-integer_linear_programming #workload_unpredictability

Meta Info

Presented in NSDI 2023.

Authors: Hong Zhang (UWaterloo), Yupeng Tang, Anurag Khandelwal (Yale), Ion Stoica (UC Berkeley)

Understanding the paper

TL;DR

  • This work presents Shepherd, a model serving system.

  • It uses a two-level design that decouples model serving into planning and serving modules.

    • Plan: aggregate request streams into moderately-sized groups.

    • Serve: employ an online algorithm; leverage preemption and model-specific batching

Challenges

  • Short-term workload unpredictability

    • The request arrival rates can be quite unpredictable at smaller time granularities (e.g., milliseconds).

  • Resource utilization v.s. scalability

    • (1) Make periodic provisioning and serving decisions for each user stream independently → Over-provisioning GPUs → Poor resource utilization

    • (2) Time-multiplex the GPU cluster across different user streams → Increased computational complexity when the number of the numbers of request streams and GPUs are large → Poor scalability

Designs

  • As indicated in Nexus (SOSP 2019), a simple linear model can accurately model the execution latency for varying batch sizes.

Periodic planner (Herd)


Use an Integer Linear Programming (ILP) to combine streams into serving groups to maximize the minimum burst tolerance across all streams.

  • Decision variables

  • Input parameters

  • Optimization Objective

  • Constraints

    • Cluster-size limit

    • Group-worker limit

    • GPU memory limit

    • Affinity-set surjectivity

    • No per-stream SLO constraint.

Online serving algorithm (Flex)

Objective: maximize the overall goodput.

  • Choose the largest feasible batch across all model priority queues (sorted by deadlines).

  • Preemption

    • Insert exit points between different DNN layers

    • Trade-off the preemption and execution delay overheads

Evaluation

  • Baselines

    • Clockwork (OSDI 2020)

    • Nexus (SOSP 2019)

  • Setup

    • Testbed: 12 p3.2xlarge instances (8 vCPUs, 61GB RAM, 1 V100 GPU with 16GB memory)

    • Emulation: m4.16xlarge instances (64 vCPUs, 256GB RAM)

    • The request router, periodic planner, and online schedulers are deployed on separate m4.16xlarge instances.

Comments

Preemption

If preemption occurs, requests in preempted batch that can still meet their SLOs are re-enqueued to their corresponding priority queues. The re-enqueued requests will be treated as newly arrived requests so they can be scheduled again.

The preempted batch doesn't contribute to system throughput. This leads to wasted GPUs.

Dynamic model swapping

Shepherd only loads models onto GPU memory at the start of a planning period. An alternative solution is to dynamically swap models between GPU and CPU memory on-demand during online serving. However, since such swaps are likely to take much longer than serving a request, its cost must be weighed against the potential performance gains from swapping in a new model. We leave incorporating this decision as a part of online serving as future work.

It doesn't support to dynamically swap models between CPU and GPU memory.

Support large DNN models

If a DNN model is so large that it cannot be co-located with other models in GPU memory, Herd must place it in an isolated group with reduced degree of multiplexing. It is possible, however, to break such large models into smaller partitions to group them with other models for better multiplexing.

AlpaServe (OSDI 2023) seems to adopt this way by partitioning large models.

Last updated