HexGen: Generative inference of foundation model over heterogeneous decentralized environment
Meta Info
Presented in arxiv:2311.11514.
Understanding the paper
TL;DR
Formally define the scheduling of serving the inference of multiple copies of the same foundation model concurrently over a heterogeneous set of GPU devices as a constrained optimization problem
Each pipeline stage can consider a different tensor model parallel degree
Propose a heuristic-based evolutionary algorithm to search for the optimal layout
HexGen — a distributed inference engine
Support asymmetric tensor model parallelism and pipeline parallelism under the heterogeneous setting
Select a leader GPU node in a pipeline stage
Manage the peer-to-peer communication between pipeline stages
Manage the broadcast operation of the received activations within its tensor model parallel group
Formulation
— a set of GPU devices
— GPU memory limit
— GPU memory bandwidth
— Tensor core computation power
— The communication delay matrix between these devices
— The delay between device and
— The communication bandwidth matrix between these devices
— The bandwidth between the device and
— The total number of layers in the model
— An assignment
— A subset of GPU devices
Serve the -th model replica as an independent pipeline
Serve the -th stage in the -th pipeline
— Transformer layers
→ Run tensor model parallelism
An optimal assignment
s.t.
— The communication cost
— The computation cost
— Memory consumption for the device
Objective: Find an optimal assignment that partitions the device set to represent multiple independent inference pipeline groups that can maximize the inference service SLO considering the computation cost, communication cost, and memory consumption constraints
Implementation
Essential change: each pipeline parallel stage can be assigned with a different number of layers and tensor model parallel degree
Steps
Each stage selects a leader GPU to initialize an independent tensor model parallel group
Only the leader node in each stage (i.e., tensor model parallel group) sends the activation to the leader GPU in the next stage
After receiving the activation, the leader GPU broadcasts this activation among its tensor model parallel group to execute the tensor model parallel computation
Evaluation
Compared to Petals
Petals depends on dynamic adjustment of the collective learning paradigm to ensure elasticity → a dynamic design compromises the inference service performance
HexGen carefully designs static scheduling of the inference workflow
Metrics
SLO attainment
Generate some inference workload according to a Poisson process parameterized by request rate
For a target SLO goal (e.g., 99%)
The minimum latency deadline to achieve the desired attainment
The system’s resilience to peak request rate
Llama 2 70B model
Real-world prompts: https://huggingface.co/datasets/lmsys/chatbot_arena_conversations
Output sequence length: 32, 64, 128
Request rates vary between 0.125 - 10 requests per second
→ the default SLO is set as tight as inference latency
Last updated