📜
Awesome Papers
  • Introduction
  • Paper List
    • Systems for ML
      • Data Processing
      • Deep Learning Training
      • Resource Scheduler
      • Model Serving
      • Large Language Model (LLM)
      • Diffusion Models
      • Deep Learning Recommendation Model (DLRM)
      • Mixture of Experts (MoE)
      • Hyper-Parameter Tuning (HPO)
      • Reinforcement Learning (RL)
      • Deep Learning Compiler
      • Deep Learning Framework
      • Cloud-Edge Collaboration
    • ML for Systems
    • Artificial Intelligence (AI)
      • Diffusion Models
      • Language Models
      • Deep Learning Recommendation Model (DLRM)
    • Hardware Virtualization
      • GPU Sharing
    • Resource Disaggregation
      • GPU Disaggregation
      • Memory Disaggregation
    • Resource Fragmentation
    • Cloud Computing
      • Sky Computing
      • Serverless Computing
      • Spot Instances
    • Remote Direct Memory Access (RDMA)
    • Research Skills
    • Miscellaneous
  • Reading Notes
    • Conference
      • ICML 2025
      • ATC 2025
      • OSDI 2025
      • HotOS 2025
      • MLSys 2025
      • NSDI 2025
      • ASPLOS 2025
      • EuroSys 2025
      • HPCA 2025
      • PPoPP 2025
      • NeurIPS 2024
      • SoCC 2024
      • HotNets 2024
      • SC 2024
      • SOSP 2024
      • VLDB 2024
      • SIGCOMM 2024
      • ICML 2024
      • ATC 2024
      • OSDI 2024
      • ISCA 2024
      • CVPR 2024
      • MLSys 2024
      • ASPLOS 2024
        • SpotServe: Serving generative large language models on preemptible instances
      • EuroSys 2024
        • Orion: Interference-aware, fine-grained GPU sharing for ML applications
      • NSDI 2024
      • NeurIPS 2023
      • SC 2023
        • Interference-aware multiplexing for deep learning in GPU clusters: A middleware approach
      • SoCC 2023
      • SOSP 2023
        • UGache: A unified GPU cache for embedding-based deep learning
      • SIGCOMM 2023
      • HotChips 2023
      • ICML 2023
      • ATC 2023
        • Accelerating Distributed MoE Training and Inference with Lina
        • SmartMoE: Efficiently Training Sparsely-Activated Models ...
        • Beware of Fragmentation: Scheduling GPU-Sharing Workloads with Fragmentation Gradient Descent
      • OSDI 2023
      • HotOS 2023
      • SIGMOD 2023
      • ISCA 2023
      • MLSys 2023
      • EuroSys 2023
      • NSDI 2023
        • Shepherd: Serving DNNs in the wild
        • Understanding RDMA microarchitecture resources for performance isolation
        • Skyplane: Optimizing transfer cost and throughput using cloud-aware overlays
        • Shockwave: Fair and efficient cluster scheduling for dynamic adaptation in machine learning
      • ASPLOS 2023
        • TPP: Transparent page placement for CXL-enabled tiered-memory
        • EVStore: Storage and caching capabilities for scaling embedding tables in deep recommendation system
        • Lucid: A non-intrusive, scalable and interpretable scheduler for deep learning training jobs
      • SC 2022
      • SoCC 2022
        • ESCHER: Expressive scheduling with ephemeral resources
        • Serving unseen deep learning model with near-optimal configurations: A fast adaptive search approach
      • SIGCOMM 2022
        • Multi-resource interleaving for deep learning training
      • ATC 2022
        • PilotFish: Harvesting Free Cycles of Cloud Gaming with Deep Learning Training
        • Memory Harvesting in Multi-GPU Systems with Hierarchical Unified Virtual Memory
        • Whale: Efficient Giant Model Training over Heterogeneous GPUs
        • DVABatch: Diversity-aware Multi-Entry Multi-Exit Batching for Efficient Processing of DNN Service...
        • Serving Heterogeneous Machine Learning Models on Multi-GPU Servers with Spatio-Temporal Sharing
        • SOTER: Guarding Black-box Inference for General Neural Networks at the Edge
        • Direct access, high-performance memory disaggregation with DirectCXL
      • OSDI 2022
        • Orca: A distributed serving system for transformer-based generative models
        • Microsecond-scale preemption for concurrent GPU-accelerated DNN inferences
        • Looking beyond GPUs for DNN scheduling on multi-tenant clusters
      • IPDPS 2022
        • DGSF: Disaggregated GPUs for serverless functions
      • EuroSys 2022
        • Slashing the disaggregation tax in heterogeneous data centers with FractOS
      • NSDI 2022
      • SoCC 2021
      • ATC 2021
        • Zico: Efficient GPU memory sharing for concurrent DNN training
      • OSDI 2021
        • Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning
      • SOSP 2021
        • HeMem: Scalable Tiered Memory Management for Big Data Applications and Real NVM
      • EuroSys 2021
        • Take it to the limit: Peak prediction-driven resource overcommitment in datacenters
      • HotOS 2021
        • From cloud computing to sky computing
      • NSDI 2021
      • OSDI 2020
        • A unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters
        • HiveD: Sharing a GPU cluster for deep learning with guarantees
      • ATC 2020
        • Serverless in the wild: Characterizing and optimizing the serverless workload
      • EuroSys 2020
      • ASPLOS 2020
      • MLSys 2020
      • SoCC 2020
        • Elastic Parameter Server Load Distribution in Deep Learning Clusters
      • HPDC 2020
        • KubeShare: A framework to manage GPUs as first-class and shared resources in container cloud
      • CLUSTER 2019
      • EuroSys 2019
      • NSDI 2019
      • IWQoS 2019
        • Who limits the resource efficiency of my datacenter: An analysis of Alibaba datacenter traces
      • SIGCOMM 2018
        • Revisiting network support for RDMA
      • OSDI 2018
        • Ray: A distributed framework for emerging AI applications
      • EuroSys 2018
        • Medea: Scheduling of long running applications in shared production clusters
      • ISPA/IUCC/BDCloud/SocialCom/SustainCom 2018
        • GaiaGPU: Sharing GPUs in container clouds
      • SoCC 2017
        • SLAQ: Quality-driven scheduling for distributed machine learning
      • ASPLOS 2017
        • Neurosurgeon: Collaborative intelligence between the cloud and mobile edge
      • NSDI 2017
        • Clipper: A low-latency online prediction serving system
      • CLUSTER 2014
        • Evaluating job packing in warehouse-scale computing
    • Journal
      • IEEE Transactions on Cloud Computing
        • 2021
          • Gemini: Enabling multi-tenant GPU sharing based on kernel burst estimation
      • ACM Computing Surveys
        • 2017
          • GPU virtualization and scheduling methods: A comprehensive survey
      • ACM SIGCOMM Computer Communication Review (CCR)
        • 2021
          • Data-driven Networking Research: models for academic collaboration with industry
        • 2007
          • How to Read a Paper
      • Communications of the ACM
        • 2015
          • Why Google stores billions of lines of code in a single repository
    • Miscellaneous
      • arXiv
        • 2024
          • Efficiently programming large language models using SGLang
        • 2023
          • HexGen: Generative inference of foundation model over heterogeneous decentralized environment
          • High-throughput generative inference of large language models with a single GPU
        • 2022
          • DisaggRec: Architecting disaggregated systems for large-scale personalized recommendation
          • A case for disaggregation of ML data processing
          • Singularity: Planet-scale, preemptive and elastic scheduling of AI workloads
          • Aryl: An elastic cluster scheduler for deep learning
        • 2016
          • Wide & deep learning for recommender systems
          • Training deep nets with sublinear memory cost
      • MSR Technical Report
        • 2011
          • Heuristics for vector bin packing
  • About Myself
    • Academic Profile
    • Personal Blog (in Chinese)
Powered by GitBook
On this page
  • Meta Info
  • Understanding the paper
  • TL;DR
  • Challenges
  • Designs
  • Evaluation
  • Comments

Was this helpful?

Edit on GitHub
  1. Reading Notes
  2. Conference
  3. NSDI 2023

Shepherd: Serving DNNs in the wild

#model_serving_system #mixed-integer_linear_programming #workload_unpredictability

Last updated 5 months ago

Was this helpful?

Meta Info

Presented in .

Authors: Hong Zhang (UWaterloo), Yupeng Tang, Anurag Khandelwal (Yale), Ion Stoica (UC Berkeley)

Understanding the paper

TL;DR

  • This work presents Shepherd, a model serving system.

  • It uses a two-level design that decouples model serving into planning and serving modules.

    • Plan: aggregate request streams into moderately-sized groups.

    • Serve: employ an online algorithm; leverage preemption and model-specific batching

Challenges

  • Short-term workload unpredictability

    • The request arrival rates can be quite unpredictable at smaller time granularities (e.g., milliseconds).

  • Resource utilization v.s. scalability

    • (1) Make periodic provisioning and serving decisions for each user stream independently → Over-provisioning GPUs → Poor resource utilization

    • (2) Time-multiplex the GPU cluster across different user streams → Increased computational complexity when the number of the numbers of request streams and GPUs are large → Poor scalability

Designs

  • As indicated in Nexus (SOSP 2019), a simple linear model can accurately model the execution latency for varying batch sizes.

    • lm=αm⋅∣B∣+βm{l}_{m} = \alpha_m \cdot | B | + \beta_mlm​=αm​⋅∣B∣+βm​

    • BBB: the batch size.

    • αm\alpha_mαm​: the latency for each additional request in the batch.

    • βm\beta_mβm​: the baseline execution latency for executing an empty batch on the model.

Periodic planner (Herd)

  1. Each request stream has an average load ratei{rate}_iratei​

  2. Measure the maximum goodput TiT_iTi​, i.e., each stream iii can achieve on a single GPU

  3. Compute ni=rateiTin_i = \frac{{rate}_i}{T_i}ni​=Ti​ratei​​


Use an Integer Linear Programming (ILP) to combine streams into serving groups to maximize the minimum burst tolerance across all streams.

Burst tolerance metric: bt(i)=∑jsizej⋅xijnibt(i) = \sum_{j}\frac{{size_j \cdot {x}_{ij}}}{{n}_{i}}bt(i)=∑j​ni​sizej​⋅xij​​

  • Decision variables

    • xij∈{0,1}x_{ij} \in \{0,1\}xij​∈{0,1}: Is stream iii mapped to group jjj?

    • ycj∈{0,1}y_{cj} \in \{0,1\}ycj​∈{0,1}: Is affinity-set ccc mapped to group jjj?

    • zkj∈{0,1}z_{kj} \in \{0,1\}zkj​∈{0,1}: Is model kkk mapped to group jjj?

    • sizejsize_jsizej​: # of GPUs allocated to group jjj?

  • Input parameters

    • memmemmem: GPU memory capacity

    • GGG: Scalability limit for # of GPUs per group

    • NNN: # of GPUs in cluster

    • hki∈{0,1}h_{ki} \in \{0,1\}hki​∈{0,1}: Does stream iii use model kkk?

    • qck∈{0,1}q_{ck} \in \{0,1\}qck​∈{0,1}: Does affinity-set ccc include model kkk?

  • Optimization Objective

    • maximize mini{bt(i)}\text{maximize}\ min_{i}{\{bt(i)\}}maximize mini​{bt(i)}

  • Constraints

    • Cluster-size limit

      • ∑jsizej≤N\sum_{j}{{size}_j} \le N∑j​sizej​≤N

    • Group-worker limit

      • sizej≤G{size}_j \le Gsizej​≤G

    • GPU memory limit

      • ∑kzkjâ‹…mk≤mem\sum_{k}{z_{kj}} \cdot {m_k} \le mem∑k​zkj​⋅mk​≤mem

    • Group surjectivity (every stream iii is assigned to a single group jjj and only if its associated model is also assigned to group jjj)

      • ∑jxij=1,,∀i\sum_{j}{x_{ij}} = 1, , \forall {i}∑j​xij​=1,,∀i

      • hkiâ‹…xij≤zkj,∀i,j,kh_{ki} \cdot x_{ij} \le z_{kj}, \forall {i,j,k}hki​⋅xij​≤zkj​,∀i,j,k

        • Make sure if stream iii is mapped to group jjj and stream iii uses model kkk, model kkk must be mapped to group jjj.

    • Affinity-set surjectivity

      • ∑cycj=1,∀j\sum_{c}{y_{cj}} = 1, \forall j∑c​ycj​=1,∀j

      • qckâ‹…zkj≤ycj,∀i,j,kq_{ck} \cdot z_{kj} \le y_{cj}, \forall {i,j,k}qck​⋅zkj​≤ycj​,∀i,j,k

    • No per-stream SLO constraint.

Online serving algorithm (Flex)

  1. Each request rrr has an arrival time ara_rar​, deadline drd_rdr​, queries model mrm_rmr​.

  2. For a batch BBB

    1. Arrival time a(B)a(B)a(B) is the arrival time of the most recent request in BBB

    2. Deadline d(B)d(B)d(B) is the earliest deadline of all requests in BBB

Objective: maximize the overall goodput.

  • Choose the largest feasible batch across all model priority queues (sorted by deadlines).

  • Preempt the current batch if the generated batch is λ\lambdaλ times larger than currently running batch.

  • Preemption

    • Insert exit points between different DNN layers

    • Trade-off the preemption and execution delay overheads

Evaluation

  • Baselines

    • Clockwork (OSDI 2020)

    • Nexus (SOSP 2019)

  • Setup

    • Testbed: 12 p3.2xlarge instances (8 vCPUs, 61GB RAM, 1 V100 GPU with 16GB memory)

    • Emulation: m4.16xlarge instances (64 vCPUs, 256GB RAM)

    • The request router, periodic planner, and online schedulers are deployed on separate m4.16xlarge instances.

Comments

Preemption

If preemption occurs, requests in preempted batch that can still meet their SLOs are re-enqueued to their corresponding priority queues. The re-enqueued requests will be treated as newly arrived requests so they can be scheduled again.

The preempted batch doesn't contribute to system throughput. This leads to wasted GPUs.

Dynamic model swapping

Shepherd only loads models onto GPU memory at the start of a planning period. An alternative solution is to dynamically swap models between GPU and CPU memory on-demand during online serving. However, since such swaps are likely to take much longer than serving a request, its cost must be weighed against the potential performance gains from swapping in a new model. We leave incorporating this decision as a part of online serving as future work.

It doesn't support to dynamically swap models between CPU and GPU memory.

Support large DNN models

If a DNN model is so large that it cannot be co-located with other models in GPU memory, Herd must place it in an isolated group with reduced degree of multiplexing. It is possible, however, to break such large models into smaller partitions to group them with other models for better multiplexing.

AlpaServe (OSDI 2023) seems to adopt this way by partitioning large models.

NSDI 2023