SOSP 2024
Last updated
Was this helpful?
Last updated
Was this helpful?
Homepage:
17.3% (= 43 / 248)
LLM Training
ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation [] [] []
Stanford
Dynamically re-route the work of a failed server to data-parallel peers; execute within bubbles of the original pipeline schedule.
Enabling Parallelism Hot Switching for Efficient Training of Large Language Models [] []
PKU
HotSPa — a system that adopts multiple parallelism strategies for efficient training with sequence inputs
Classify a mini-batch of training samples into several groups and train each group with the most suitable parallelism strategy.
The graph compiler generates multiple executable computation graphs that share the same backbone storage of model states.
Deduces efficient many-to-many communication plans of parallelism hot switching.
Tenplex: Dynamic Parallelism for Deep Learning using Parallelizable Tensor Collections [] [] []
ICL & Aalto University & Edinburgh
Tenplex — a state management library.
Enable jobs to change the parallelism dynamically.
PTC: Parallelizable Tensor Collection
Dataset state
Modle state
Execute PTC transformations in parallel with minimum data movement between workers.
Reducing Energy Bloat in Large Model Training [] [] []
UMich
Perseus: use a graph cut-based algorithm to obtain the "iteration time-energy" Pareto frontier; schedule the energy consumption across time.
LLM Inference
LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism [] [] []
PKU
ESP: Elastic Sequence Parallelism
Elastically adjust the degree of parallelism in real-time; reduce key-value cache migration overhead and overlap partial decoding communication with computation; reduce key-value cache fragmentation across instances.
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU [] [] []
SJTU IPADS
GPU-CPU hybrid inference engine: hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU.
Integrate adaptive predictors and neuron-aware sparse operators.
GaTech & Princeton & Stanford
(Efficient Early-Exits): trade-off accuracy and resource costs on a per-input granularity; early-exit models.
Key insight: split and replicate blocks of layers in models; maintain a constant batch size throughout execution.
Princeton & GaTech
Automatically apply and manage early exits (certain inputs can exit with results at intermediate layers) in ML models.
UIUC & MSRA
T10, the first DL compiler to exploit the inter-core communication bandwidth and distributed on-chip memory on AI chips (i.e., Graphcore IPU).
Indian Institute of Science & MSR
Search over several optimization choices and automatically generate high-performance inference routines for CPUs and GPUs.
Two components: (1) a scheduling language that encapsulates the optimization space; (2) an optimizing retargetable compiler that can generate code for any specified schedule.
ETH
Simplify state management of the existing orchestration system (Kubernetes); eliminate persistent state updates; run monolithic control and data planes to minimize internal communication overheads.
Improving DNN Inference Throughput using Practical, Per-Input Compute Adaptation []
Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving [] [] []
Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor [] []
SilvanForge: A Schedule-Guided Retargetable Compiler for Decision Tree Inference []
Dirigent: Lightweight Serverless Orchestration [] [] []