# SOSP 2024

## Meta Info

Homepage: <https://sigops.org/s/conferences/sosp/2024/>

### Paper List

* <https://dl.acm.org/doi/proceedings/10.1145/3694715>
* <https://sigops.org/s/conferences/sosp/2024/accepted.html>

### Acceptance Rate

17.3% (= 43 / 248)

## Papers

### Large Language Models (LLMs)

* LLM Training
  * ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation \[[Paper](https://dl.acm.org/doi/10.1145/3694715.3695960)] \[[Slides](https://swapnilgandhi.com/slides/recycle-sosp24.pdf)] \[[arXiv](https://arxiv.org/abs/2405.14009)]
    * Stanford
    * Dynamically re-route the work of a failed server to data-parallel peers; execute within bubbles of the original pipeline schedule.
  * Enabling Parallelism Hot Switching for Efficient Training of Large Language Models \[[Paper](https://dl.acm.org/doi/10.1145/3694715.3695969)] \[[Code](https://github.com/PKU-DAIR/Hetu)]
    * PKU
    * **HotSPa** — a system that adopts multiple parallelism strategies for efficient training with sequence inputs
      * Classify a mini-batch of training samples into several groups and train each group with the most suitable parallelism strategy.
      * The graph compiler generates multiple executable computation graphs that share the same backbone storage of model states.
      * Deduces efficient many-to-many communication plans of parallelism hot switching.
  * Tenplex: Dynamic Parallelism for Deep Learning using Parallelizable Tensor Collections \[[Paper](https://dl.acm.org/doi/10.1145/3694715.3695975)] \[[Code](https://github.com/kungfu-team/tenplex)] \[[arXiv](https://arxiv.org/abs/2312.05181)]
    * ICL & Aalto University & Edinburgh
    * **Tenplex** — a state management library.
      * Enable jobs to change the parallelism dynamically.
      * PTC: Parallelizable Tensor Collection
        * Dataset state
        * Modle state&#x20;
      * Execute PTC transformations in parallel with minimum data movement between workers.
  * Reducing Energy Bloat in Large Model Training \[[Paper](https://dl.acm.org/doi/10.1145/3694715.3695970)] \[[Code](https://github.com/ml-energy/zeus)] \[[arXiv](https://arxiv.org/abs/2312.06902)]
    * UMich
    * **Perseus**: use a graph cut-based algorithm to obtain the "iteration time-energy" Pareto frontier; schedule the energy consumption across time.
* LLM Inference
  * LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism \[[Paper](https://dl.acm.org/doi/10.1145/3694715.3695948)] \[[Code](https://github.com/LoongServe/LoongServe)] \[[arXiv](https://arxiv.org/abs/2404.09526)]
    * PKU
    * **ESP**: Elastic Sequence Parallelism
    * Elastically adjust the degree of parallelism in real-time; reduce key-value cache migration overhead and overlap partial decoding communication with computation; reduce key-value cache fragmentation across instances.
  * PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU \[[Paper](https://dl.acm.org/doi/10.1145/3694715.3695964)] \[[Code](https://github.com/SJTU-IPADS/PowerInfer)] \[[arXiv](https://arxiv.org/abs/2312.12456)]
    * SJTU IPADS
    * GPU-CPU hybrid inference engine: hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU.
    * Integrate adaptive predictors and neuron-aware sparse operators.

### Model Serving

* Improving DNN Inference Throughput using Practical, Per-Input Compute Adaptation \[[Paper](https://dl.acm.org/doi/pdf/10.1145/3694715.3695978)]
  * GaTech & Princeton & Stanford
  * $$E^3$$ (**E**fficient **E**arly-**E**xits): trade-off accuracy and resource costs on a per-input granularity; early-exit models.
  * Key insight: split and replicate blocks of layers in models; maintain a constant batch size throughout execution.
* Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving \[[Paper](https://dl.acm.org/doi/10.1145/3694715.3695963)] \[[Code](https://github.com/dywsjtu/apparate)] \[[arXiv](https://arxiv.org/abs/2312.05385)]
  * Princeton & GaTech
  * Automatically apply and manage early exits (certain inputs can exit with results at intermediate layers) in ML models.

### ML Compilation

* Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor \[[Paper](https://dl.acm.org/doi/10.1145/3694715.3695955)] \[[arXiv](https://arxiv.org/abs/2408.04808)]
  * UIUC & MSRA
  * **T10**, the first DL compiler to exploit the inter-core communication bandwidth and distributed on-chip memory on AI chips (i.e., Graphcore IPU).
* SilvanForge: A Schedule-Guided Retargetable Compiler for Decision Tree Inference \[[Paper](https://dl.acm.org/doi/10.1145/3694715.3695958)]
  * Indian Institute of Science & MSR
  * Search over several optimization choices and automatically generate high-performance inference routines for CPUs and GPUs.
  * Two components: (1) a scheduling language that encapsulates the optimization space; (2) an optimizing retargetable compiler that can generate code for any specified schedule.

### Serverless Computing

* Dirigent: Lightweight Serverless Orchestration \[[Paper](https://dl.acm.org/doi/10.1145/3694715.3695966)] \[[Code](https://github.com/eth-easl/dirigent)] \[[arXiv](https://arxiv.org/abs/2404.16393)]
  * ETH
  * Simplify state management of the existing orchestration system (Kubernetes); eliminate persistent state updates; run monolithic control and data planes to minimize internal communication overheads.
