SpotServe: Serving generative large language models on preemptible instances

Meta Info

Presented in ASPLOS 2024.

Understanding the paper

TL;DR

  • SpotServe β€” the first distributed LLM serving system on preemptible instances

  • Techniques

    • Dynamically adapt the LLM parallelization configuration

    • Minimize the cost of migrating instances for dynamic reparallelization

      • Formulated as a bipartite graph matching problem β†’ use the KuhnMunkres algorithm to identify an optimal migration plan

    • Stateful inference recovery

      • Commit inference progress at a much finer granularity

      • Resume inference upon preemption

Background

  • Spot instances

    • Lower price than on-demand instances

    • May be preempted at any time

    • Grace period (e.g., 30 seconds for AWS spot instances)

Existing works

  • Leverage spot instances to reduce the monetary cost of DNN inference

    • Example: MArk, Cocktail

    • Limitation: Target small DNN models that can fit on a single spot instance with one or multiple GPU

    • Handle preemptions using request rerouting or redundant computation

Challenges in serving LLMs on spot GPU instances

  • Dynamic reparallelization β€” how to quickly adapt to changes to spot instances’ availability and requests’ arrival rates?

  • Instance migration β€” how to minimize the cost of migrating GPU instances for reparallelization?

  • Grace period β€” how to leverage grace period to handle unfinished request?

Designs

An overview of SpotServe
  • Inference Server

    • Deployed on a dedicated on-demand CPU instance

    • Three components

      • Request Manager

        • Receiving input requests

        • Dynamically partition them into batches

        • Assign these batches to inference instances running on spot GPU instances

        • Collect generated outputs from the inference instances

        • Send the results back to users

      • Meta-context Manager

        • Manage the adjustment of the parallel configuration by sending instructions for context migration to all GPU instances

        • Modules

          • Parallelization Controller

            • Adjust the parallelization configuration to improve LLM serving performance

            • A parallel configuration β€” C=(D,P,M,B)C = (D, P, M, B)

              • DD β€” data parallelism degree

              • PP β€” pipeline-model parallelism degree

              • MM β€” tensor-model parallelism degree

              • BB β€” the maximum mini-batch size

            • Measure the initialization time in advance

            • Adaptive optimization algorithm

              • Two variables CtC_t and at time step tt

                • CtC_t β€” parallel configuration

                • NtN_t β€” the number of available instances

                  • Include newly allocated instances

                  • Exclude instances to be preempted

              • Minimizes the end-to-end inference latency lreq(C)l_{req}(C) while maintaining a throughput higher than α𝑑\alpha_𝑑

              • If multiple configurations can achieve similar minimum inference latency β†’ Select the configuration with lower monetary cost (i.e., using fewer instances)

              • If peak serving throughput can not exceed the request arrival rate α𝑑\alpha_𝑑 β†’ Maximize the overall serving throughput

              • Optionally allocate on-demand instances to improve serving throughput

              • Run the online algorithm, negligible overhead (i.e., less than 1s)

              • Offline estimate the latency of different configurations in advance

          • Device Mapper

            • Use the Kuhn-Munkres (KM) algorithm to find an optimal device mapping β†’ Maximally reuse the model parameters and KV cache on available GPU instances & minimize the total data transmission

            • Device mapping β€” a bipartite graph G=(Va,Vt,Ξ΅)\mathcal{G} = (\mathcal{V_a}, \mathcal{V_t}, \varepsilon)

              • u∈Vau \in \mathcal{V_a} β€” a GPU device

              • v∈Vtv \in \mathcal{V_t} β€” a pipeline-stage-shard position of the parallel configuration

              • a weighted edge 𝑒𝑒𝑣𝑒_{𝑒𝑣} β€” the amount of reusable model parameters and key/value cache when mapping GPU 𝑒 to position 𝑣 of the parallel configuration

            • Build a complete bipartite graph and compute the edge weight between every (𝑒,𝑣)(𝑒, 𝑣) pair using the size of their intersection contexts

            • If the new parallel configuration handles less concurrent inference requests

              • Discard part of the cached results β†’ avoid exceeding the memory capacity of the new parallel configuration

              • Keep the batches of requests with more decoding progresses

          • Migration Planner

            • Determine the exact migration plan to finish the configuration adjustment

            • Progressive migration schedule β€” utilize the pipeline structure and prioritize the migration of front model layers’ context

            • Consider the memory usage during the progressive migration process

      • Instance Manager

        • Interacts with the cloud and receives instance preemption/acquisition notifications

        • Allocates on-demand and spot instances at the same time to avoid the waiting overhead when spot-instance allocation fails

        • Prefer to release on-demand instances

        • Keep few additional instances (e.g., 2 in experiments) to alleviate the impacts of frequent disturbance of instance availability

  • Inference Engine

    • Deployed on each spot or on-demand GPU instance to serve LLM inference

    • Components

      • Context daemon

        • Manages the model parameters (i.e., model context) and intermediate activations (i.e., cache context) for different requests inside a certain GPU

      • Interruption Arranger

        • Support stateful inference recovery

  • Stateful inference recovery

    • Recover interrupted inference request without recomputation

    • Context daemon maintains the cache context of an inference request

    • Route the request to another inference pipeline using the cached state

    • Just-in-time arrangement

      • Each spot GPU instance includes an interruption arranger that receives a notification when a grace period starts

    • Fault tolerance

      • Delay the acquired instance joining and make the arrangements for prior interruptions feasible

      • One instance gets preempted before expected β†’ Give up the cache context and only migrate the model context with the rest instances

      • All replicas of the same piece of model context are lost due to unexpected failures β†’ Restart by loading weights locally (e.g., disk) or from remote cloud storage (e.g., S3) to fetch the required model parameters

Implementation

Evaluation

  • Settings

    • A real 12-hour availability trace with AWS g4dn spot instance and extract two representative 20-minute segments with different dynamic behaviors

    • Two workloads

      • Stable inference request arrival workload

        • Different request arrival rates for different models

          • 1.5 requests/s for OPT-6.7B

          • 0.35 requests/s for GPT-20B

          • 0.2 requests/s for LLaMA-30B

        • Gamma request arrival process with a coefficient of variance of 6

      • Fluctuating inference request arrival workload

    • The maximum batch size BB is selected from {1,2,4,8}\{1,2,4,8\}

    • 𝑆𝑖𝑛𝑆_{𝑖𝑛} is 512 β€” the sequence length of the input tokens

    • π‘†π‘œπ‘’π‘‘π‘†_{π‘œπ‘’π‘‘} is 128 β€” the sequence length of output tokens

  • Baselines

    • Rerouting β€” dynamically reroutes interrupted requests to other available pipelines when preemption happens

    • Reparallelization β€” restart and reinitialize all instances without context migration

  • Metrics

    • The average and various tail latencies

    • Monetary cost β€” USD/token

Limitations and future work

  • Strongly rely on the grace period β†’ Can explore more solutions to improve system performance (e.g., inference workload prediction, instance availability prediction)

  • Focus on single-type GPU instances β†’ Can integrate heterogeneous spot instances or instances from different clouds

  • Take inference latency minimization as the optimization target β†’ Can explore other targets (e.g., strict SLO, high throughput)

  • Can generalize to other preemptible resources (e.g., resource scheduler may preempt resources for urgent jobs with switching overheads)

Last updated

Was this helpful?