# SpotServe: Serving generative large language models on preemptible instances

## Meta Info

Presented in [ASPLOS 2024](https://arxiv.org/abs/2311.15566).

## Understanding the paper

### TL;DR

* SpotServe — the first distributed LLM serving system on *preemptible instances*
* Techniques
  * Dynamically adapt the LLM parallelization configuration
  * Minimize the cost of migrating instances for dynamic reparallelization
    * Formulated as a bipartite graph matching problem → use the KuhnMunkres algorithm to identify an optimal migration plan
  * Stateful inference recovery
    * Commit inference progress at a much finer granularity
    * Resume inference upon preemption

### Background

* Spot instances
  * Lower price than on-demand instances
  * May be preempted at any time
  * Grace period (e.g., 30 seconds for AWS spot instances)

### Existing works

* Leverage spot instances to reduce the monetary cost of DNN inference
  * Example: [MArk](https://www.usenix.org/conference/atc19/presentation/zhang-chengliang), [Cocktail](https://www.usenix.org/conference/nsdi22/presentation/gunasekaran)
  * Limitation: Target *small* DNN models that can *fit on a single spot instance* with one or multiple GPU
  * Handle preemptions using *request rerouting* or *redundant computation*

### Challenges in serving LLMs on spot GPU instances

* Dynamic reparallelization — how to quickly adapt to changes to spot instances’ availability and requests’ arrival rates?
* Instance migration — how to minimize the cost of migrating GPU instances for reparallelization?
* Grace period — how to leverage grace period to handle unfinished request?

### Designs

<figure><img src="/files/vJFCL6B62RnVE1AcRais" alt=""><figcaption><p>An overview of SpotServe</p></figcaption></figure>

* **Inference Server**
  * Deployed on a dedicated *on-demand CPU instance*
  * Three components
    * **Request Manager**
      * Receiving input requests
      * Dynamically partition them into batches
      * Assign these batches to inference instances running on spot GPU instances
      * Collect generated outputs from the inference instances
      * Send the results back to users
    * **Meta-context Manager**
      * Manage the adjustment of the parallel configuration by sending instructions for context migration to all GPU instances
      * Modules
        * **Parallelization Controller**
          * Adjust the parallelization configuration to improve LLM serving performance
          * A parallel configuration — $$C = (D, P, M, B)$$
            * $$D$$ — data parallelism degree
            * $$P$$ — pipeline-model parallelism degree
            * $$M$$ — tensor-model parallelism degree
            * $$B$$ — the maximum mini-batch size
          * Measure the initialization time in advance
          * *Adaptive optimization algorithm*
            * Two variables $$C\_t$$ and at time step $$t$$
              * $$C\_t$$ — parallel configuration
              * $$N\_t$$ — the number of available instances
                * Include newly allocated instances
                * Exclude instances to be preempted
            * Minimizes the end-to-end inference latency $$l\_{req}(C)$$ while maintaining a throughput higher than $$\alpha\_𝑡$$
            * If multiple configurations can achieve similar minimum inference latency → Select the configuration with *lower monetary cost* (i.e., using fewer instances)
            * If peak serving throughput can not exceed the request arrival rate $$\alpha\_𝑡$$ → Maximize the overall serving throughput
            * *Optionally* allocate on-demand instances to improve serving throughput
            * Run the online algorithm, negligible overhead (i.e., less than 1s)
            * Offline estimate the latency of different configurations in advance
        * **Device Mapper**
          * Use the Kuhn-Munkres (KM) algorithm to find an optimal device mapping → Maximally reuse the model parameters and KV cache on available GPU instances & minimize the total data transmission
          * Device mapping — a bipartite graph $$\mathcal{G} = (\mathcal{V\_a}, \mathcal{V\_t}, \varepsilon)$$
            * $$u \in \mathcal{V\_a}$$ — a GPU device
            * $$v \in \mathcal{V\_t}$$ — a pipeline-stage-shard position of the parallel configuration
            * a weighted edge $$𝑒\_{𝑢𝑣}$$ — the amount of reusable model parameters and key/value cache when mapping GPU 𝑢 to position 𝑣 of the parallel configuration
          * Build a complete bipartite graph and compute the edge weight between every $$(𝑢, 𝑣)$$ pair using the size of their intersection contexts
          * If the new parallel configuration handles less concurrent inference requests
            * Discard part of the cached results → avoid exceeding the memory capacity of the new parallel configuration
            * Keep the batches of requests with more decoding progresses
        * **Migration Planner**
          * Determine the exact *migration plan* to finish the configuration adjustment
          * *Progressive migration* schedule — utilize the pipeline structure and prioritize the migration of front model layers’ context
          * Consider the memory usage during the progressive migration process
    * **Instance Manager**
      * Interacts with the cloud and receives instance preemption/acquisition notifications
      * *Allocates on-demand and spot instances at the same time* to avoid the waiting overhead when spot-instance allocation fails
      * Prefer to release on-demand instances
      * *Keep few additional instances* (e.g., 2 in experiments) to alleviate the impacts of frequent disturbance of instance availability
* **Inference Engine**
  * Deployed on each spot or on-demand GPU instance to serve LLM inference
  * Components
    * **Context daemon**
      * Manages the model parameters (i.e., *model context*) and intermediate activations (i.e., *cache context*) for different requests inside a certain GPU
    * **Interruption Arranger**
      * Support stateful inference recovery
* Stateful inference recovery
  * Recover interrupted inference request without recomputation
  * Context daemon maintains the cache context of an inference request
  * Route the request to another inference pipeline using the cached state
  * Just-in-time arrangement
    * Each spot GPU instance includes *an interruption arranger* that receives a notification when a grace period starts
  * Fault tolerance
    * Delay the acquired instance joining and make the arrangements for prior interruptions feasible
    * One instance gets preempted before expected → Give up the cache context and only migrate the model context with the rest instances
    * All replicas of the same piece of model context are lost due to unexpected failures → Restart by loading weights locally (e.g., disk) or from remote cloud storage (e.g., S3) to fetch the required model parameters

### Implementation

* 5.6K LoC in C++ and 2.2 LoC in Python
* Built on top of [FasterTransformer](https://github.com/NVIDIA/FasterTransformer/)

### Evaluation

* Settings
  * A real *12-hour availability trace* with *AWS g4dn spot instance* and extract *two representative 20-minute segments* with different dynamic behaviors
  * Two workloads
    * Stable inference request arrival workload
      * Different request arrival rates for different models
        * 1.5 requests/s for OPT-6.7B
        * 0.35 requests/s for GPT-20B
        * 0.2 requests/s for LLaMA-30B
      * *Gamma request arrival process* with a coefficient of variance of 6
    * Fluctuating inference request arrival workload
      * Trace: [Serverless-in-the-wild](https://www.usenix.org/conference/atc20/presentation/shahrad)
  * The maximum batch size $$B$$ is selected from $${1,2,4,8}$$
  * $$𝑆\_{𝑖𝑛}$$ is 512 — the sequence length of the input tokens
  * $$𝑆\_{𝑜𝑢𝑡}$$ is 128 — the sequence length of output tokens
* Baselines
  * **Rerouting** — dynamically reroutes interrupted requests to other available pipelines when preemption happens
  * **Reparallelization** — restart and reinitialize all instances without context migration
* Metrics
  * The average and various tail latencies
  * Monetary cost — USD/token

### Limitations and future work

* Strongly rely on the grace period → Can explore more solutions to improve system performance (e.g., inference workload prediction, instance availability prediction)
* Focus on single-type GPU instances → Can integrate heterogeneous spot instances or instances from different clouds
* Take inference latency minimization as the optimization target → Can explore other targets (e.g., strict SLO, high throughput)
* Can generalize to other preemptible resources (e.g., resource scheduler may preempt resources for urgent jobs with switching overheads)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://paper.lingyunyang.com/reading-notes/conference/asplos-2024/spotserve.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
