📜
Awesome Papers
  • Introduction
  • Paper List
    • Systems for ML
      • Data Processing
      • Deep Learning Training
      • Resource Scheduler
      • Model Serving
      • Large Language Model (LLM)
      • Diffusion Models
      • Deep Learning Recommendation Model (DLRM)
      • Mixture of Experts (MoE)
      • Hyper-Parameter Tuning (HPO)
      • Reinforcement Learning (RL)
      • Deep Learning Compiler
      • Deep Learning Framework
      • Cloud-Edge Collaboration
    • ML for Systems
    • Artificial Intelligence (AI)
      • Diffusion Models
      • Language Models
      • Deep Learning Recommendation Model (DLRM)
    • Hardware Virtualization
      • GPU Sharing
    • Resource Disaggregation
      • GPU Disaggregation
      • Memory Disaggregation
    • Resource Fragmentation
    • Cloud Computing
      • Sky Computing
      • Serverless Computing
      • Spot Instances
    • Remote Direct Memory Access (RDMA)
    • Research Skills
    • Miscellaneous
  • Reading Notes
    • Conference
      • ICML 2025
      • ATC 2025
      • OSDI 2025
      • HotOS 2025
      • MLSys 2025
      • NSDI 2025
      • ASPLOS 2025
      • EuroSys 2025
      • HPCA 2025
      • PPoPP 2025
      • NeurIPS 2024
      • SoCC 2024
      • HotNets 2024
      • SC 2024
      • SOSP 2024
      • VLDB 2024
      • SIGCOMM 2024
      • ICML 2024
      • ATC 2024
      • OSDI 2024
      • ISCA 2024
      • CVPR 2024
      • MLSys 2024
      • ASPLOS 2024
        • SpotServe: Serving generative large language models on preemptible instances
      • EuroSys 2024
        • Orion: Interference-aware, fine-grained GPU sharing for ML applications
      • NSDI 2024
      • NeurIPS 2023
      • SC 2023
        • Interference-aware multiplexing for deep learning in GPU clusters: A middleware approach
      • SoCC 2023
      • SOSP 2023
        • UGache: A unified GPU cache for embedding-based deep learning
      • SIGCOMM 2023
      • HotChips 2023
      • ICML 2023
      • ATC 2023
        • Accelerating Distributed MoE Training and Inference with Lina
        • SmartMoE: Efficiently Training Sparsely-Activated Models ...
        • Beware of Fragmentation: Scheduling GPU-Sharing Workloads with Fragmentation Gradient Descent
      • OSDI 2023
      • HotOS 2023
      • SIGMOD 2023
      • ISCA 2023
      • MLSys 2023
      • EuroSys 2023
      • NSDI 2023
        • Shepherd: Serving DNNs in the wild
        • Understanding RDMA microarchitecture resources for performance isolation
        • Skyplane: Optimizing transfer cost and throughput using cloud-aware overlays
        • Shockwave: Fair and efficient cluster scheduling for dynamic adaptation in machine learning
      • ASPLOS 2023
        • TPP: Transparent page placement for CXL-enabled tiered-memory
        • EVStore: Storage and caching capabilities for scaling embedding tables in deep recommendation system
        • Lucid: A non-intrusive, scalable and interpretable scheduler for deep learning training jobs
      • SC 2022
      • SoCC 2022
        • ESCHER: Expressive scheduling with ephemeral resources
        • Serving unseen deep learning model with near-optimal configurations: A fast adaptive search approach
      • SIGCOMM 2022
        • Multi-resource interleaving for deep learning training
      • ATC 2022
        • PilotFish: Harvesting Free Cycles of Cloud Gaming with Deep Learning Training
        • Memory Harvesting in Multi-GPU Systems with Hierarchical Unified Virtual Memory
        • Whale: Efficient Giant Model Training over Heterogeneous GPUs
        • DVABatch: Diversity-aware Multi-Entry Multi-Exit Batching for Efficient Processing of DNN Service...
        • Serving Heterogeneous Machine Learning Models on Multi-GPU Servers with Spatio-Temporal Sharing
        • SOTER: Guarding Black-box Inference for General Neural Networks at the Edge
        • Direct access, high-performance memory disaggregation with DirectCXL
      • OSDI 2022
        • Orca: A distributed serving system for transformer-based generative models
        • Microsecond-scale preemption for concurrent GPU-accelerated DNN inferences
        • Looking beyond GPUs for DNN scheduling on multi-tenant clusters
      • IPDPS 2022
        • DGSF: Disaggregated GPUs for serverless functions
      • EuroSys 2022
        • Slashing the disaggregation tax in heterogeneous data centers with FractOS
      • NSDI 2022
      • SoCC 2021
      • ATC 2021
        • Zico: Efficient GPU memory sharing for concurrent DNN training
      • OSDI 2021
        • Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning
      • SOSP 2021
        • HeMem: Scalable Tiered Memory Management for Big Data Applications and Real NVM
      • EuroSys 2021
        • Take it to the limit: Peak prediction-driven resource overcommitment in datacenters
      • HotOS 2021
        • From cloud computing to sky computing
      • NSDI 2021
      • OSDI 2020
        • A unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters
        • HiveD: Sharing a GPU cluster for deep learning with guarantees
      • ATC 2020
        • Serverless in the wild: Characterizing and optimizing the serverless workload
      • EuroSys 2020
      • ASPLOS 2020
      • MLSys 2020
      • SoCC 2020
        • Elastic Parameter Server Load Distribution in Deep Learning Clusters
      • HPDC 2020
        • KubeShare: A framework to manage GPUs as first-class and shared resources in container cloud
      • CLUSTER 2019
      • EuroSys 2019
      • NSDI 2019
      • IWQoS 2019
        • Who limits the resource efficiency of my datacenter: An analysis of Alibaba datacenter traces
      • SIGCOMM 2018
        • Revisiting network support for RDMA
      • OSDI 2018
        • Ray: A distributed framework for emerging AI applications
      • EuroSys 2018
        • Medea: Scheduling of long running applications in shared production clusters
      • ISPA/IUCC/BDCloud/SocialCom/SustainCom 2018
        • GaiaGPU: Sharing GPUs in container clouds
      • SoCC 2017
        • SLAQ: Quality-driven scheduling for distributed machine learning
      • ASPLOS 2017
        • Neurosurgeon: Collaborative intelligence between the cloud and mobile edge
      • NSDI 2017
        • Clipper: A low-latency online prediction serving system
      • CLUSTER 2014
        • Evaluating job packing in warehouse-scale computing
    • Journal
      • IEEE Transactions on Cloud Computing
        • 2021
          • Gemini: Enabling multi-tenant GPU sharing based on kernel burst estimation
      • ACM Computing Surveys
        • 2017
          • GPU virtualization and scheduling methods: A comprehensive survey
      • ACM SIGCOMM Computer Communication Review (CCR)
        • 2021
          • Data-driven Networking Research: models for academic collaboration with industry
        • 2007
          • How to Read a Paper
      • Communications of the ACM
        • 2015
          • Why Google stores billions of lines of code in a single repository
    • Miscellaneous
      • arXiv
        • 2024
          • Efficiently programming large language models using SGLang
        • 2023
          • HexGen: Generative inference of foundation model over heterogeneous decentralized environment
          • High-throughput generative inference of large language models with a single GPU
        • 2022
          • DisaggRec: Architecting disaggregated systems for large-scale personalized recommendation
          • A case for disaggregation of ML data processing
          • Singularity: Planet-scale, preemptive and elastic scheduling of AI workloads
          • Aryl: An elastic cluster scheduler for deep learning
        • 2016
          • Wide & deep learning for recommender systems
          • Training deep nets with sublinear memory cost
      • MSR Technical Report
        • 2011
          • Heuristics for vector bin packing
  • About Myself
    • Academic Profile
    • Personal Blog (in Chinese)
Powered by GitBook
On this page
  • Meta Info
  • Understanding the paper
  • TL;DR
  • Background
  • Existing works
  • Challenges in serving LLMs on spot GPU instances
  • Designs
  • Implementation
  • Evaluation
  • Limitations and future work

Was this helpful?

Edit on GitHub
  1. Reading Notes
  2. Conference
  3. ASPLOS 2024

SpotServe: Serving generative large language models on preemptible instances

Last updated 1 year ago

Was this helpful?

Meta Info

Presented in .

Understanding the paper

TL;DR

  • SpotServe — the first distributed LLM serving system on preemptible instances

  • Techniques

    • Dynamically adapt the LLM parallelization configuration

    • Minimize the cost of migrating instances for dynamic reparallelization

      • Formulated as a bipartite graph matching problem → use the KuhnMunkres algorithm to identify an optimal migration plan

    • Stateful inference recovery

      • Commit inference progress at a much finer granularity

      • Resume inference upon preemption

Background

  • Spot instances

    • Lower price than on-demand instances

    • May be preempted at any time

    • Grace period (e.g., 30 seconds for AWS spot instances)

Existing works

  • Leverage spot instances to reduce the monetary cost of DNN inference

    • Limitation: Target small DNN models that can fit on a single spot instance with one or multiple GPU

    • Handle preemptions using request rerouting or redundant computation

Challenges in serving LLMs on spot GPU instances

  • Dynamic reparallelization — how to quickly adapt to changes to spot instances’ availability and requests’ arrival rates?

  • Instance migration — how to minimize the cost of migrating GPU instances for reparallelization?

  • Grace period — how to leverage grace period to handle unfinished request?

Designs

  • Inference Server

    • Deployed on a dedicated on-demand CPU instance

    • Three components

      • Request Manager

        • Receiving input requests

        • Dynamically partition them into batches

        • Assign these batches to inference instances running on spot GPU instances

        • Collect generated outputs from the inference instances

        • Send the results back to users

      • Meta-context Manager

        • Manage the adjustment of the parallel configuration by sending instructions for context migration to all GPU instances

        • Modules

          • Parallelization Controller

            • Adjust the parallelization configuration to improve LLM serving performance

            • A parallel configuration — C=(D,P,M,B)C = (D, P, M, B)C=(D,P,M,B)

              • DDD — data parallelism degree

              • PPP — pipeline-model parallelism degree

              • MMM — tensor-model parallelism degree

              • BBB — the maximum mini-batch size

            • Measure the initialization time in advance

            • Adaptive optimization algorithm

              • Two variables CtC_tCt​ and at time step ttt

                • CtC_tCt​ — parallel configuration

                • NtN_tNt​ — the number of available instances

                  • Include newly allocated instances

                  • Exclude instances to be preempted

              • Minimizes the end-to-end inference latency lreq(C)l_{req}(C)lreq​(C) while maintaining a throughput higher than α𝑡\alpha_𝑡αt​

              • If multiple configurations can achieve similar minimum inference latency → Select the configuration with lower monetary cost (i.e., using fewer instances)

              • If peak serving throughput can not exceed the request arrival rate α𝑡\alpha_𝑡αt​ → Maximize the overall serving throughput

              • Optionally allocate on-demand instances to improve serving throughput

              • Run the online algorithm, negligible overhead (i.e., less than 1s)

              • Offline estimate the latency of different configurations in advance

          • Device Mapper

            • Use the Kuhn-Munkres (KM) algorithm to find an optimal device mapping → Maximally reuse the model parameters and KV cache on available GPU instances & minimize the total data transmission

            • Device mapping — a bipartite graph G=(Va,Vt,ε)\mathcal{G} = (\mathcal{V_a}, \mathcal{V_t}, \varepsilon)G=(Va​,Vt​,ε)

              • u∈Vau \in \mathcal{V_a}u∈Va​ — a GPU device

              • v∈Vtv \in \mathcal{V_t}v∈Vt​ — a pipeline-stage-shard position of the parallel configuration

              • a weighted edge 𝑒𝑢𝑣𝑒_{𝑢𝑣}euv​ — the amount of reusable model parameters and key/value cache when mapping GPU 𝑢 to position 𝑣 of the parallel configuration

            • Build a complete bipartite graph and compute the edge weight between every (𝑢,𝑣)(𝑢, 𝑣)(u,v) pair using the size of their intersection contexts

            • If the new parallel configuration handles less concurrent inference requests

              • Discard part of the cached results → avoid exceeding the memory capacity of the new parallel configuration

              • Keep the batches of requests with more decoding progresses

          • Migration Planner

            • Determine the exact migration plan to finish the configuration adjustment

            • Progressive migration schedule — utilize the pipeline structure and prioritize the migration of front model layers’ context

            • Consider the memory usage during the progressive migration process

      • Instance Manager

        • Interacts with the cloud and receives instance preemption/acquisition notifications

        • Allocates on-demand and spot instances at the same time to avoid the waiting overhead when spot-instance allocation fails

        • Prefer to release on-demand instances

        • Keep few additional instances (e.g., 2 in experiments) to alleviate the impacts of frequent disturbance of instance availability

  • Inference Engine

    • Deployed on each spot or on-demand GPU instance to serve LLM inference

    • Components

      • Context daemon

        • Manages the model parameters (i.e., model context) and intermediate activations (i.e., cache context) for different requests inside a certain GPU

      • Interruption Arranger

        • Support stateful inference recovery

  • Stateful inference recovery

    • Recover interrupted inference request without recomputation

    • Context daemon maintains the cache context of an inference request

    • Route the request to another inference pipeline using the cached state

    • Just-in-time arrangement

      • Each spot GPU instance includes an interruption arranger that receives a notification when a grace period starts

    • Fault tolerance

      • Delay the acquired instance joining and make the arrangements for prior interruptions feasible

      • One instance gets preempted before expected → Give up the cache context and only migrate the model context with the rest instances

      • All replicas of the same piece of model context are lost due to unexpected failures → Restart by loading weights locally (e.g., disk) or from remote cloud storage (e.g., S3) to fetch the required model parameters

Implementation

  • 5.6K LoC in C++ and 2.2 LoC in Python

Evaluation

  • Settings

    • A real 12-hour availability trace with AWS g4dn spot instance and extract two representative 20-minute segments with different dynamic behaviors

    • Two workloads

      • Stable inference request arrival workload

        • Different request arrival rates for different models

          • 1.5 requests/s for OPT-6.7B

          • 0.35 requests/s for GPT-20B

          • 0.2 requests/s for LLaMA-30B

        • Gamma request arrival process with a coefficient of variance of 6

      • Fluctuating inference request arrival workload

    • The maximum batch size BBB is selected from {1,2,4,8}\{1,2,4,8\}{1,2,4,8}

    • 𝑆𝑖𝑛𝑆_{𝑖𝑛}Sin​ is 512 — the sequence length of the input tokens

    • 𝑆𝑜𝑢𝑡𝑆_{𝑜𝑢𝑡}Sout​ is 128 — the sequence length of output tokens

  • Baselines

    • Rerouting — dynamically reroutes interrupted requests to other available pipelines when preemption happens

    • Reparallelization — restart and reinitialize all instances without context migration

  • Metrics

    • The average and various tail latencies

    • Monetary cost — USD/token

Limitations and future work

  • Strongly rely on the grace period → Can explore more solutions to improve system performance (e.g., inference workload prediction, instance availability prediction)

  • Focus on single-type GPU instances → Can integrate heterogeneous spot instances or instances from different clouds

  • Take inference latency minimization as the optimization target → Can explore other targets (e.g., strict SLO, high throughput)

  • Can generalize to other preemptible resources (e.g., resource scheduler may preempt resources for urgent jobs with switching overheads)

Example: ,

Built on top of

Trace:

ASPLOS 2024
MArk
Cocktail
FasterTransformer
Serverless-in-the-wild
An overview of SpotServe