# SOSP 2024

## Meta Info

Homepage: <https://sigops.org/s/conferences/sosp/2024/>

### Paper List

* <https://dl.acm.org/doi/proceedings/10.1145/3694715>
* <https://sigops.org/s/conferences/sosp/2024/accepted.html>

### Acceptance Rate

17.3% (= 43 / 248)

## Papers

### Large Language Models (LLMs)

* LLM Training
  * ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation \[[Paper](https://dl.acm.org/doi/10.1145/3694715.3695960)] \[[Slides](https://swapnilgandhi.com/slides/recycle-sosp24.pdf)] \[[arXiv](https://arxiv.org/abs/2405.14009)]
    * Stanford
    * Dynamically re-route the work of a failed server to data-parallel peers; execute within bubbles of the original pipeline schedule.
  * Enabling Parallelism Hot Switching for Efficient Training of Large Language Models \[[Paper](https://dl.acm.org/doi/10.1145/3694715.3695969)] \[[Code](https://github.com/PKU-DAIR/Hetu)]
    * PKU
    * **HotSPa** — a system that adopts multiple parallelism strategies for efficient training with sequence inputs
      * Classify a mini-batch of training samples into several groups and train each group with the most suitable parallelism strategy.
      * The graph compiler generates multiple executable computation graphs that share the same backbone storage of model states.
      * Deduces efficient many-to-many communication plans of parallelism hot switching.
  * Tenplex: Dynamic Parallelism for Deep Learning using Parallelizable Tensor Collections \[[Paper](https://dl.acm.org/doi/10.1145/3694715.3695975)] \[[Code](https://github.com/kungfu-team/tenplex)] \[[arXiv](https://arxiv.org/abs/2312.05181)]
    * ICL & Aalto University & Edinburgh
    * **Tenplex** — a state management library.
      * Enable jobs to change the parallelism dynamically.
      * PTC: Parallelizable Tensor Collection
        * Dataset state
        * Modle state
      * Execute PTC transformations in parallel with minimum data movement between workers.
  * Reducing Energy Bloat in Large Model Training \[[Paper](https://dl.acm.org/doi/10.1145/3694715.3695970)] \[[Code](https://github.com/ml-energy/zeus)] \[[arXiv](https://arxiv.org/abs/2312.06902)]
    * UMich
    * **Perseus**: use a graph cut-based algorithm to obtain the "iteration time-energy" Pareto frontier; schedule the energy consumption across time.
* LLM Inference
  * LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism \[[Paper](https://dl.acm.org/doi/10.1145/3694715.3695948)] \[[Code](https://github.com/LoongServe/LoongServe)] \[[arXiv](https://arxiv.org/abs/2404.09526)]
    * PKU
    * **ESP**: Elastic Sequence Parallelism
    * Elastically adjust the degree of parallelism in real-time; reduce key-value cache migration overhead and overlap partial decoding communication with computation; reduce key-value cache fragmentation across instances.
  * PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU \[[Paper](https://dl.acm.org/doi/10.1145/3694715.3695964)] \[[Code](https://github.com/SJTU-IPADS/PowerInfer)] \[[arXiv](https://arxiv.org/abs/2312.12456)]
    * SJTU IPADS
    * GPU-CPU hybrid inference engine: hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU.
    * Integrate adaptive predictors and neuron-aware sparse operators.

### Model Serving

* Improving DNN Inference Throughput using Practical, Per-Input Compute Adaptation \[[Paper](https://dl.acm.org/doi/pdf/10.1145/3694715.3695978)]
  * GaTech & Princeton & Stanford
  * $$E^3$$ (**E**fficient **E**arly-**E**xits): trade-off accuracy and resource costs on a per-input granularity; early-exit models.
  * Key insight: split and replicate blocks of layers in models; maintain a constant batch size throughout execution.
* Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving \[[Paper](https://dl.acm.org/doi/10.1145/3694715.3695963)] \[[Code](https://github.com/dywsjtu/apparate)] \[[arXiv](https://arxiv.org/abs/2312.05385)]
  * Princeton & GaTech
  * Automatically apply and manage early exits (certain inputs can exit with results at intermediate layers) in ML models.

### ML Compilation

* Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor \[[Paper](https://dl.acm.org/doi/10.1145/3694715.3695955)] \[[arXiv](https://arxiv.org/abs/2408.04808)]
  * UIUC & MSRA
  * **T10**, the first DL compiler to exploit the inter-core communication bandwidth and distributed on-chip memory on AI chips (i.e., Graphcore IPU).
* SilvanForge: A Schedule-Guided Retargetable Compiler for Decision Tree Inference \[[Paper](https://dl.acm.org/doi/10.1145/3694715.3695958)]
  * Indian Institute of Science & MSR
  * Search over several optimization choices and automatically generate high-performance inference routines for CPUs and GPUs.
  * Two components: (1) a scheduling language that encapsulates the optimization space; (2) an optimizing retargetable compiler that can generate code for any specified schedule.

### Serverless Computing

* Dirigent: Lightweight Serverless Orchestration \[[Paper](https://dl.acm.org/doi/10.1145/3694715.3695966)] \[[Code](https://github.com/eth-easl/dirigent)] \[[arXiv](https://arxiv.org/abs/2404.16393)]
  * ETH
  * Simplify state management of the existing orchestration system (Kubernetes); eliminate persistent state updates; run monolithic control and data planes to minimize internal communication overheads.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://paper.lingyunyang.com/reading-notes/conference/sosp-2024.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
