SOSP 2024
Meta Info
Homepage: https://sigops.org/s/conferences/sosp/2024/
Paper List
Papers
Large Language Models (LLMs)
LLM Training
Enabling Parallelism Hot Switching for Efficient Training of Large Language Models [Paper] [Code]
PKU
HotSPa — a system that adopts multiple parallelism strategies for efficient training with sequence inputs
Classify a mini-batch of training samples into several groups and train each group with the most suitable parallelism strategy.
The graph compiler generates multiple executable computation graphs that share the same backbone storage of model states.
Deduces efficient many-to-many communication plans of parallelism hot switching.
Tenplex: Dynamic Parallelism for Deep Learning using Parallelizable Tensor Collections [Paper] [Code] [arXiv]
ICL & Aalto University & Edinburgh
Tenplex — a state management library.
Enable jobs to change the parallelism dynamically.
PTC: Parallelizable Tensor Collection
Dataset state
Modle state
Execute PTC transformations in parallel with minimum data movement between workers.
LLM Inference
LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism [Paper] [Code] [arXiv]
PKU
ESP: Elastic Sequence Parallelism
Elastically adjust the degree of parallelism in real-time; reduce key-value cache migration overhead and overlap partial decoding communication with computation; reduce key-value cache fragmentation across instances.
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU [Paper] [Code] [arXiv]
SJTU IPADS
GPU-CPU hybrid inference engine: hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU.
Integrate adaptive predictors and neuron-aware sparse operators.
Model Serving
Improving DNN Inference Throughput using Practical, Per-Input Compute Adaptation [Paper]
GaTech & Princeton & Stanford
(Efficient Early-Exits): trade-off accuracy and resource costs on a per-input granularity; early-exit models.
Key insight: split and replicate blocks of layers in models; maintain a constant batch size throughout execution.
ML Compilation
SilvanForge: A Schedule-Guided Retargetable Compiler for Decision Tree Inference [Paper]
Indian Institute of Science & MSR
Search over several optimization choices and automatically generate high-performance inference routines for CPUs and GPUs.
Two components: (1) a scheduling language that encapsulates the optimization space; (2) an optimizing retargetable compiler that can generate code for any specified schedule.
Serverless Computing
Last updated