SOSP 2024

Meta Info

Homepage: https://sigops.org/s/conferences/sosp/2024/

Paper List

Acceptance Rate

17.3% (= 43 / 248)

Papers

Large Language Models (LLMs)

LLM Training
- ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation [Paper] [Slides] [arXiv]
  - Stanford
  - Dynamically re-route the work of a failed server to data-parallel peers; execute within bubbles of the original pipeline schedule.
- Enabling Parallelism Hot Switching for Efficient Training of Large Language Models [Paper] [Code]
  - PKU
  - HotSPa — a system that adopts multiple parallelism strategies for efficient training with sequence inputs
    Classify a mini-batch of training samples into several groups and train each group with the most suitable parallelism strategy.
    The graph compiler generates multiple executable computation graphs that share the same backbone storage of model states.
    Deduces efficient many-to-many communication plans of parallelism hot switching.
- Tenplex: Dynamic Parallelism for Deep Learning using Parallelizable Tensor Collections [Paper] [Code] [arXiv]
  - ICL & Aalto University & Edinburgh
  - Tenplex — a state management library.
    Enable jobs to change the parallelism dynamically.
    PTC: Parallelizable Tensor Collection
    Dataset state
    Modle state
    Execute PTC transformations in parallel with minimum data movement between workers.
- Reducing Energy Bloat in Large Model Training [Paper] [Code] [arXiv]
  - UMich
  - Perseus: use a graph cut-based algorithm to obtain the "iteration time-energy" Pareto frontier; schedule the energy consumption across time.
LLM Inference
- LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism [Paper] [Code] [arXiv]
  - PKU
  - ESP: Elastic Sequence Parallelism
  - Elastically adjust the degree of parallelism in real-time; reduce key-value cache migration overhead and overlap partial decoding communication with computation; reduce key-value cache fragmentation across instances.
- PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU [Paper] [Code] [arXiv]
  - SJTU IPADS
  - GPU-CPU hybrid inference engine: hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU.
  - Integrate adaptive predictors and neuron-aware sparse operators.

Model Serving

Improving DNN Inference Throughput using Practical, Per-Input Compute Adaptation [Paper]
- GaTech & Princeton & Stanford
- $E^3$ (Efficient Early-Exits): trade-off accuracy and resource costs on a per-input granularity; early-exit models.
- Key insight: split and replicate blocks of layers in models; maintain a constant batch size throughout execution.
Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving [Paper] [Code] [arXiv]
- Princeton & GaTech
- Automatically apply and manage early exits (certain inputs can exit with results at intermediate layers) in ML models.

ML Compilation

Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor [Paper] [arXiv]
- UIUC & MSRA
- T10, the first DL compiler to exploit the inter-core communication bandwidth and distributed on-chip memory on AI chips (i.e., Graphcore IPU).
SilvanForge: A Schedule-Guided Retargetable Compiler for Decision Tree Inference [Paper]
- Indian Institute of Science & MSR
- Search over several optimization choices and automatically generate high-performance inference routines for CPUs and GPUs.
- Two components: (1) a scheduling language that encapsulates the optimization space; (2) an optimizing retargetable compiler that can generate code for any specified schedule.

Serverless Computing

Dirigent: Lightweight Serverless Orchestration [Paper] [Code] [arXiv]
- ETH
- Simplify state management of the existing orchestration system (Kubernetes); eliminate persistent state updates; run monolithic control and data planes to minimize internal communication overheads.

Last updated 5 months ago

Was this helpful?