SpotServe: Serving generative large language models on preemptible instances

Meta Info

Presented in ASPLOS 2024.

Understanding the paper

TL;DR

  • SpotServe — the first distributed LLM serving system on preemptible instances

  • Techniques

    • Dynamically adapt the LLM parallelization configuration

    • Minimize the cost of migrating instances for dynamic reparallelization

      • Formulated as a bipartite graph matching problem → use the KuhnMunkres algorithm to identify an optimal migration plan

    • Stateful inference recovery

      • Commit inference progress at a much finer granularity

      • Resume inference upon preemption

Background

  • Spot instances

    • Lower price than on-demand instances

    • May be preempted at any time

    • Grace period (e.g., 30 seconds for AWS spot instances)

Existing works

  • Leverage spot instances to reduce the monetary cost of DNN inference

    • Example: MArk, Cocktail

    • Limitation: Target small DNN models that can fit on a single spot instance with one or multiple GPU

    • Handle preemptions using request rerouting or redundant computation

Challenges in serving LLMs on spot GPU instances

  • Dynamic reparallelization — how to quickly adapt to changes to spot instances’ availability and requests’ arrival rates?

  • Instance migration — how to minimize the cost of migrating GPU instances for reparallelization?

  • Grace period — how to leverage grace period to handle unfinished request?

Designs

  • Inference Server

    • Deployed on a dedicated on-demand CPU instance

    • Three components

      • Request Manager

        • Receiving input requests

        • Dynamically partition them into batches

        • Assign these batches to inference instances running on spot GPU instances

        • Collect generated outputs from the inference instances

        • Send the results back to users

      • Meta-context Manager

        • Manage the adjustment of the parallel configuration by sending instructions for context migration to all GPU instances

        • Modules

          • Parallelization Controller

            • Adjust the parallelization configuration to improve LLM serving performance

            • A parallel configuration — C=(D,P,M,B)C = (D, P, M, B)

              • DD — data parallelism degree

              • PP — pipeline-model parallelism degree

              • MM — tensor-model parallelism degree

              • BB — the maximum mini-batch size

            • Measure the initialization time in advance

            • Adaptive optimization algorithm

              • Two variables CtC_t and at time step tt

                • CtC_t — parallel configuration

                • NtN_t — the number of available instances

                  • Include newly allocated instances

                  • Exclude instances to be preempted

              • Minimizes the end-to-end inference latency lreq(C)l_{req}(C) while maintaining a throughput higher than α𝑡\alpha_𝑡

              • If multiple configurations can achieve similar minimum inference latency → Select the configuration with lower monetary cost (i.e., using fewer instances)

              • If peak serving throughput can not exceed the request arrival rate α𝑡\alpha_𝑡 → Maximize the overall serving throughput

              • Optionally allocate on-demand instances to improve serving throughput

              • Run the online algorithm, negligible overhead (i.e., less than 1s)

              • Offline estimate the latency of different configurations in advance

          • Device Mapper

            • Use the Kuhn-Munkres (KM) algorithm to find an optimal device mapping → Maximally reuse the model parameters and KV cache on available GPU instances & minimize the total data transmission

            • Device mapping — a bipartite graph G=(Va,Vt,ε)\mathcal{G} = (\mathcal{V_a}, \mathcal{V_t}, \varepsilon)

              • uVau \in \mathcal{V_a} — a GPU device

              • vVtv \in \mathcal{V_t} — a pipeline-stage-shard position of the parallel configuration

              • a weighted edge 𝑒𝑢𝑣𝑒_{𝑢𝑣} — the amount of reusable model parameters and key/value cache when mapping GPU 𝑢 to position 𝑣 of the parallel configuration

            • Build a complete bipartite graph and compute the edge weight between every (𝑢,𝑣)(𝑢, 𝑣) pair using the size of their intersection contexts

            • If the new parallel configuration handles less concurrent inference requests

              • Discard part of the cached results → avoid exceeding the memory capacity of the new parallel configuration

              • Keep the batches of requests with more decoding progresses

          • Migration Planner

            • Determine the exact migration plan to finish the configuration adjustment

            • Progressive migration schedule — utilize the pipeline structure and prioritize the migration of front model layers’ context

            • Consider the memory usage during the progressive migration process

      • Instance Manager

        • Interacts with the cloud and receives instance preemption/acquisition notifications

        • Allocates on-demand and spot instances at the same time to avoid the waiting overhead when spot-instance allocation fails

        • Prefer to release on-demand instances

        • Keep few additional instances (e.g., 2 in experiments) to alleviate the impacts of frequent disturbance of instance availability

  • Inference Engine

    • Deployed on each spot or on-demand GPU instance to serve LLM inference

    • Components

      • Context daemon

        • Manages the model parameters (i.e., model context) and intermediate activations (i.e., cache context) for different requests inside a certain GPU

      • Interruption Arranger

        • Support stateful inference recovery

  • Stateful inference recovery

    • Recover interrupted inference request without recomputation

    • Context daemon maintains the cache context of an inference request

    • Route the request to another inference pipeline using the cached state

    • Just-in-time arrangement

      • Each spot GPU instance includes an interruption arranger that receives a notification when a grace period starts

    • Fault tolerance

      • Delay the acquired instance joining and make the arrangements for prior interruptions feasible

      • One instance gets preempted before expected → Give up the cache context and only migrate the model context with the rest instances

      • All replicas of the same piece of model context are lost due to unexpected failures → Restart by loading weights locally (e.g., disk) or from remote cloud storage (e.g., S3) to fetch the required model parameters

Implementation

Evaluation

  • Settings

    • A real 12-hour availability trace with AWS g4dn spot instance and extract two representative 20-minute segments with different dynamic behaviors

    • Two workloads

      • Stable inference request arrival workload

        • Different request arrival rates for different models

          • 1.5 requests/s for OPT-6.7B

          • 0.35 requests/s for GPT-20B

          • 0.2 requests/s for LLaMA-30B

        • Gamma request arrival process with a coefficient of variance of 6

      • Fluctuating inference request arrival workload

    • The maximum batch size BB is selected from {1,2,4,8}\{1,2,4,8\}

    • 𝑆𝑖𝑛𝑆_{𝑖𝑛} is 512 — the sequence length of the input tokens

    • 𝑆𝑜𝑢𝑡𝑆_{𝑜𝑢𝑡} is 128 — the sequence length of output tokens

  • Baselines

    • Rerouting — dynamically reroutes interrupted requests to other available pipelines when preemption happens

    • Reparallelization — restart and reinitialize all instances without context migration

  • Metrics

    • The average and various tail latencies

    • Monetary cost — USD/token

Limitations and future work

  • Strongly rely on the grace period → Can explore more solutions to improve system performance (e.g., inference workload prediction, instance availability prediction)

  • Focus on single-type GPU instances → Can integrate heterogeneous spot instances or instances from different clouds

  • Take inference latency minimization as the optimization target → Can explore other targets (e.g., strict SLO, high throughput)

  • Can generalize to other preemptible resources (e.g., resource scheduler may preempt resources for urgent jobs with switching overheads)

Last updated