# HexGen: Generative inference of foundation model over heterogeneous decentralized environment

## Meta Info

Presented in [arxiv:2311.11514](https://arxiv.org/abs/2311.11514).

## Understanding the paper

### TL;DR

* Formally define the scheduling of serving the inference of multiple copies of the same foundation model concurrently over a heterogeneous set of GPU devices as *a constrained optimization problem*
  * Each pipeline stage can consider a different tensor model parallel degree
  * Propose *a heuristic-based evolutionary algorithm* to search for the optimal layout
* HexGen — a distributed inference engine
  * Support asymmetric tensor model parallelism and pipeline parallelism under the *heterogeneous* setting
  * Select a leader GPU node in a pipeline stage
    * Manage the peer-to-peer communication between pipeline stages
    * Manage the broadcast operation of the received activations within its tensor model parallel group

### Formulation

* $$\textbf{D} = { d\_1, \cdots, d\_N }$$ — a set of $$N$$ GPU devices
  * $$M\_d$$ — GPU memory limit
  * $$m\_d$$ — GPU memory bandwidth
  * $$c\_d$$ — Tensor core computation power
  * $$\textbf{A} \in \mathbb{R}\_{+}^{N \times N}$$ — The communication *delay* matrix between these devices
    * $$\alpha\_{d, d'}$$ — The delay between device $$d$$ and $$d'$$
  * $$\textbf{B} \in \mathbb{R}\_{+}^{N \times N}$$ — The communication *bandwidth* matrix between these devices
    * $$\beta\_{d, d'}$$ — The bandwidth between the device $$d$$ and $$d'$$
  * $$L$$ — The total number of layers in the model
* $$\sigma$$ — An assignment $$\textbf{D} \rightarrow { \textbf{d}*{i,j}, l*{i,j} }$$
  * $$\textbf{d}\_{i,j}$$ — A subset of GPU devices
    * Serve the $$i$$-th model replica as an independent pipeline
    * Serve the $$j$$-th stage in the -th pipeline
    * $$l\_{i,j}$$ — Transformer layers
    * $$\left | \textbf{d}\_{i,j} \right| > 1$$ → Run tensor model parallelism
  * An optimal assignment $$\sigma^{\*} = \text{arg},\mathop{\text{max}}\limits\_{\sigma \in \sum}\quad\mathbb{E}*{\textbf{T} \sim \varphi}\[\text{SLO}(C*{\text{comm}}(\sigma) + C\_{\text{comp}}(\sigma))]$$
    * s.t. $$C\_{\text{mem}}^{d}(\sigma) \le M\_d, \forall d \in \textbf{D}$$
    * $$C\_{\text{comm}}(\sigma)$$ — The communication cost
    * $$C\_{\text{comp}}(\sigma)$$ — The computation cost
    * $$C\_{\text{mem}}^{d}(\sigma)$$ — Memory consumption for the device $$d$$
    * Objective: Find an *optimal* assignment that *partitions the device set to represent multiple independent inference pipeline groups* that can *maximize the inference service SLO* considering the computation cost, communication cost, and memory consumption constraints

### Implementation

* Essential change: *each pipeline parallel stage can be assigned with a different number of layers and tensor model parallel degree*
* Steps
  * Each stage selects *a leader GPU* to initialize an independent tensor model parallel group
  * Only the leader node in each stage (i.e., tensor model parallel group) sends the activation to the leader GPU in the next stage
  * After receiving the activation, the leader GPU broadcasts this activation among its tensor model parallel group to execute the tensor model parallel computation

### Evaluation

* Compared to [Petals](https://aclanthology.org/2023.acl-demo.54)
  * Petals depends on *dynamic adjustment* of the collective learning paradigm to ensure elasticity → a dynamic design compromises the inference service performance
  * HexGen carefully designs *static* scheduling of the inference workflow
* Metrics
  * SLO attainment
    * Generate some inference workload according to *a Poisson process* parameterized by *request rate*
    * For a target SLO goal (e.g., 99%)
      * *The minimum latency deadline* to achieve the desired attainment
      * The system’s resilience to *peak request rate*
    * Llama 2 70B model
    * Real-world prompts: <https://huggingface.co/datasets/lmsys/chatbot\\_arena\\_conversations>
    * Output sequence length: 32, 64, 128
    * Request rates vary between 0.125 - 10 requests per second
    * $$\text{SLO Scale}=T$$ → the default SLO is set as tight as $$T \times$$ inference latency


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://paper.lingyunyang.com/reading-notes/miscellaneous/arxiv/2023/hexgen.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
