OSDI 2024
Meta Info
Homepage: https://www.usenix.org/conference/osdi24
Paper list: https://www.usenix.org/conference/osdi24/technical-sessions
Papers
Serving Large Language Models (LLMs)
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve [Paper] [Code]
MSR India & GaTech
Sarathi-Serve
Chunked-prefills: split a prefill request into near equal-sized chunks; create stall-free schedules that add new requests in a batch without pausing ongoing decodes.
Stall-free scheduling: improve throughput with large batch sizes; minimize the effect of batching on latency.
ServerlessLLM: Low-Latency Serverless Inference for Large Language Models [Paper] [Code]
Edinburgh
Multi-tier checkpoint loading.
Live migration of LLM inference: the source server migrates only the tokens; a re-computation of the KV-cache is triggered at the destination server.
Use cost models to estimate the time of loading checkpoints from different tiers in the storage hierarchy and the time of migrating an LLM inference to another server; choose the best server to minimize model startup latency.
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management [Paper]
Seoul National University
InfiniGen: a KV cache management framework for long-text generation.
Key insight: A few important tokens can be speculated by performing a minimal rehearsal with the inputs of the current layer and part of the query weight and key cache of the subsequent layer.
Prefetch only the essential KV cache entries instead of fetching them all. -> Mitigate the fetch overhead from the host memory.
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving [Paper] [Code]
PKU & UCSD
Disaggregate the prefill and decoding computation.
Co-optimize the resource allocation and parallelism strategy for each phase; consider the cluster's bandwidth to minimize the communication overhead.
dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving [Paper]
PKU & Shanghai AI Lab
A credit-based batching algorithm to decide when to merge and unmerge LoRA adapters with the base model.
A request-adapter co-migration algorithm to decide when to migrate between different worker replicas.
Parrot: Efficient Serving of LLM-based Applications with Semantic Variable [Paper] [Code]
SJTU & MSRA
Semantic Variable: a unified abstraction to expose application-level knowledge to public LLM services.
Annotate an input/output variable in the prompt of a request.
Create the data pipeline when connecting multiple LLM requests.
Allow to perform conventional data flow analysis to uncover the correlation across multiple LLM requests.
Implemented on Python.
Fairness in Serving Large Language Models [Paper] [Code]
UC Berkeley
This is the first work to discuss the fair serving of LLMs.
Propose a fair-serving algorithm called Virtual Token Counter (VTC).
Track the services received for each client.
Prioritize the ones with the least services received.
Only manipulate the dispatch order and don't reject a request if it can fit in the batch.
Resource Allocation
Optimizing Resource Allocation in Hyperscale Datacenters: Scalability, Usability, and Experiences [Paper]
Meta Platforms
Main challenges for a resource-allocation framework.
Usability: how to translate real-life policies into precise mathematical formulas.
Scalability: NP-hard problems that cannot be solved efficiently by commercial solvers.
Rebalancer: Meta's resource-allocation framework.
An expression graph that enables its optimization algorithm to run more efficiently than past algorithms (for scalability).
A high-level specification language to lower the barrier for adoption by system practitioners (for usability).
Job Scheduling
When will my ML Job finish? Toward providing Completion Time Estimates through Predictability-Centric Scheduling [Paper] [Code]
Tufts
PCS: Predictability-Centric Scheduling
Use Weighted-Fair-Queueing (WFQ) and find a suitable configuration of different WFQ parameters (e.g., queue weights).
Use a simulation-aided search strategy to discover WFQ configurations.
MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale [Paper]
Meta Platforms
MAST: ML Application Scheduler on Twine
Provide a global-scheduling abstraction to all ML training workloads.
Three design principles: temporal decoupling, scope decoupling, and exhaustive search.
Auto Parallelization
nnScaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training [Paper] [Code]
USTC & MSRA & xAI & BaseBit Technologies
Empower domain experts to construct their own search space through three primitives,
op-trans
,op-assign
, andop-order
.Allow the application of constraints to those primitives during space construction.
Machine Learning Inference
Usher: Holistic Interference Avoidance for Resource Optimized ML Inference [Paper] [Code]
UVA & GaTech
Usher: an interference-aware ML serving system to maximize resource utilization (GPU spatial multiplexing).
GPU kernel-based model resource requirement estimator.
Heuristic-based interference-aware resource utilization-maximizing scheduler that decides the batch size, model replication degree, and model placement.
Operator graph merger to merge multiple models to minimize interference in GPU cache.
Tensor Program Generation
Machine Learning APIs
In-Network Machine Learning
Microkernel
Microkernel Goes General: Performance and Compatibility in the HongMeng Production Microkernel [Paper]
Huawei Central Software Institute & SJTU
Hong-Meng kernel (HM)
Compute Express Link (CXL)
Managing Memory Tiers with CXL in Virtualized Environments [Paper]
Columbia & Microsoft Azure & UW & Carl Waldspurger Consulting & Intel & UW-Madison & UMich
Distributed Snapshots
Network Interface Card (NIC)
Collective Communication Library
ACCL+: an FPGA-Based Collective Engine for Distributed Applications [Paper]
ETH & Amsterdam & AMD
Hardware Accelerators
Cloud Block Storage
Burstable Cloud Block Storage with Data Processing Units [Paper]
PKU & Alibaba Cloud
Formal Verification
References
Notes from SJTU IPADS (in Chinese)
Last updated