SpotServe: Serving generative large language models on preemptible instances
Meta Info
Presented in ASPLOS 2024.
Understanding the paper
TL;DR
SpotServe — the first distributed LLM serving system on preemptible instances
Techniques
Dynamically adapt the LLM parallelization configuration
Minimize the cost of migrating instances for dynamic reparallelization
Formulated as a bipartite graph matching problem → use the KuhnMunkres algorithm to identify an optimal migration plan
Stateful inference recovery
Commit inference progress at a much finer granularity
Resume inference upon preemption
Background
Spot instances
Lower price than on-demand instances
May be preempted at any time
Grace period (e.g., 30 seconds for AWS spot instances)
Existing works
Challenges in serving LLMs on spot GPU instances
Dynamic reparallelization — how to quickly adapt to changes to spot instances’ availability and requests’ arrival rates?
Instance migration — how to minimize the cost of migrating GPU instances for reparallelization?
Grace period — how to leverage grace period to handle unfinished request?
Designs
Inference Server
Deployed on a dedicated on-demand CPU instance
Three components
Request Manager
Receiving input requests
Dynamically partition them into batches
Assign these batches to inference instances running on spot GPU instances
Collect generated outputs from the inference instances
Send the results back to users
Meta-context Manager
Manage the adjustment of the parallel configuration by sending instructions for context migration to all GPU instances
Modules
Parallelization Controller
Adjust the parallelization configuration to improve LLM serving performance
A parallel configuration —
— data parallelism degree
— pipeline-model parallelism degree
— tensor-model parallelism degree
— the maximum mini-batch size
Measure the initialization time in advance
Adaptive optimization algorithm
Two variables and at time step
— parallel configuration
— the number of available instances
Include newly allocated instances
Exclude instances to be preempted
Minimizes the end-to-end inference latency while maintaining a throughput higher than
If multiple configurations can achieve similar minimum inference latency → Select the configuration with lower monetary cost (i.e., using fewer instances)
If peak serving throughput can not exceed the request arrival rate → Maximize the overall serving throughput
Optionally allocate on-demand instances to improve serving throughput
Run the online algorithm, negligible overhead (i.e., less than 1s)
Offline estimate the latency of different configurations in advance
Device Mapper
Use the Kuhn-Munkres (KM) algorithm to find an optimal device mapping → Maximally reuse the model parameters and KV cache on available GPU instances & minimize the total data transmission
Device mapping — a bipartite graph
— a GPU device
— a pipeline-stage-shard position of the parallel configuration
a weighted edge — the amount of reusable model parameters and key/value cache when mapping GPU 𝑢 to position 𝑣 of the parallel configuration
Build a complete bipartite graph and compute the edge weight between every pair using the size of their intersection contexts
If the new parallel configuration handles less concurrent inference requests
Discard part of the cached results → avoid exceeding the memory capacity of the new parallel configuration
Keep the batches of requests with more decoding progresses
Migration Planner
Determine the exact migration plan to finish the configuration adjustment
Progressive migration schedule — utilize the pipeline structure and prioritize the migration of front model layers’ context
Consider the memory usage during the progressive migration process
Instance Manager
Interacts with the cloud and receives instance preemption/acquisition notifications
Allocates on-demand and spot instances at the same time to avoid the waiting overhead when spot-instance allocation fails
Prefer to release on-demand instances
Keep few additional instances (e.g., 2 in experiments) to alleviate the impacts of frequent disturbance of instance availability
Inference Engine
Deployed on each spot or on-demand GPU instance to serve LLM inference
Components
Context daemon
Manages the model parameters (i.e., model context) and intermediate activations (i.e., cache context) for different requests inside a certain GPU
Interruption Arranger
Support stateful inference recovery
Stateful inference recovery
Recover interrupted inference request without recomputation
Context daemon maintains the cache context of an inference request
Route the request to another inference pipeline using the cached state
Just-in-time arrangement
Each spot GPU instance includes an interruption arranger that receives a notification when a grace period starts
Fault tolerance
Delay the acquired instance joining and make the arrangements for prior interruptions feasible
One instance gets preempted before expected → Give up the cache context and only migrate the model context with the rest instances
All replicas of the same piece of model context are lost due to unexpected failures → Restart by loading weights locally (e.g., disk) or from remote cloud storage (e.g., S3) to fetch the required model parameters
Implementation
5.6K LoC in C++ and 2.2 LoC in Python
Built on top of FasterTransformer
Evaluation
Settings
A real 12-hour availability trace with AWS g4dn spot instance and extract two representative 20-minute segments with different dynamic behaviors
Two workloads
Stable inference request arrival workload
Different request arrival rates for different models
1.5 requests/s for OPT-6.7B
0.35 requests/s for GPT-20B
0.2 requests/s for LLaMA-30B
Gamma request arrival process with a coefficient of variance of 6
Fluctuating inference request arrival workload
Trace: Serverless-in-the-wild
The maximum batch size is selected from
is 512 — the sequence length of the input tokens
is 128 — the sequence length of output tokens
Baselines
Rerouting — dynamically reroutes interrupted requests to other available pipelines when preemption happens
Reparallelization — restart and reinitialize all instances without context migration
Metrics
The average and various tail latencies
Monetary cost — USD/token
Limitations and future work
Strongly rely on the grace period → Can explore more solutions to improve system performance (e.g., inference workload prediction, instance availability prediction)
Focus on single-type GPU instances → Can integrate heterogeneous spot instances or instances from different clouds
Take inference latency minimization as the optimization target → Can explore other targets (e.g., strict SLO, high throughput)
Can generalize to other preemptible resources (e.g., resource scheduler may preempt resources for urgent jobs with switching overheads)
Last updated