OSDI 2024
Last updated
Was this helpful?
Last updated
Was this helpful?
Homepage:
Paper list:
19.2% (= 53 / 276)
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve [] []
MSR India & GaTech
Sarathi-Serve
Chunked-prefills: split a prefill request into near equal-sized chunks; create stall-free schedules that add new requests in a batch without pausing ongoing decodes.
Stall-free scheduling: improve throughput with large batch sizes; minimize the effect of batching on latency.
ServerlessLLM: Low-Latency Serverless Inference for Large Language Models [] []
Edinburgh
Multi-tier checkpoint loading.
Live migration of LLM inference: the source server migrates only the tokens; a re-computation of the KV-cache is triggered at the destination server.
Use cost models to estimate the time of loading checkpoints from different tiers in the storage hierarchy and the time of migrating an LLM inference to another server; choose the best server to minimize model startup latency.
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management []
Seoul National University
InfiniGen: a KV cache management framework for long-text generation.
Key insight: A few important tokens can be speculated by performing a minimal rehearsal with the inputs of the current layer and part of the query weight and key cache of the subsequent layer.
Prefetch only the essential KV cache entries instead of fetching them all. -> Mitigate the fetch overhead from the host memory.
Llumnix: Dynamic Scheduling for Large Language Model Serving [] []
Alibaba
Reschedule requests to improve load-balancing and isolation, mitigate resource fragmentation, and differentiate request priorities and SLOs.
Live migration for requests and the in-memory states (tokens).
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving [] []
PKU & UCSD
Disaggregate the prefill and decoding computation.
Co-optimize the resource allocation and parallelism strategy for each phase; consider the cluster's bandwidth to minimize the communication overhead.
dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving []
PKU & Shanghai AI Lab
A credit-based batching algorithm to decide when to merge and unmerge LoRA adapters with the base model.
A request-adapter co-migration algorithm to decide when to migrate between different worker replicas.
Parrot: Efficient Serving of LLM-based Applications with Semantic Variable [] []
SJTU & MSRA
Semantic Variable: a unified abstraction to expose application-level knowledge to public LLM services.
Annotate an input/output variable in the prompt of a request.
Create the data pipeline when connecting multiple LLM requests.
Allow to perform conventional data flow analysis to uncover the correlation across multiple LLM requests.
Implemented on Python.
Fairness in Serving Large Language Models [] []
UC Berkeley
This is the first work to discuss the fair serving of LLMs.
Propose a fair-serving algorithm called Virtual Token Counter (VTC).
Track the services received for each client.
Prioritize the ones with the least services received.
Only manipulate the dispatch order and don't reject a request if it can fit in the batch.
Meta Platforms
Main challenges for a resource-allocation framework.
Usability: how to translate real-life policies into precise mathematical formulas.
Scalability: NP-hard problems that cannot be solved efficiently by commercial solvers.
Rebalancer: Meta's resource-allocation framework.
An expression graph that enables its optimization algorithm to run more efficiently than past algorithms (for scalability).
A high-level specification language to lower the barrier for adoption by system practitioners (for usability).
Tufts
PCS: Predictability-Centric Scheduling
Use Weighted-Fair-Queueing (WFQ) and find a suitable configuration of different WFQ parameters (e.g., queue weights).
Use a simulation-aided search strategy to discover WFQ configurations.
Meta Platforms
MAST: ML Application Scheduler on Twine
Provide a global-scheduling abstraction to all ML training workloads.
Three design principles: temporal decoupling, scope decoupling, and exhaustive search.
USTC & MSRA & xAI & BaseBit Technologies
Empower domain experts to construct their own search space through three primitives, op-trans
, op-assign
, and op-order
.
Allow the application of constraints to those primitives during space construction.
UVA & GaTech
Usher: an interference-aware ML serving system to maximize resource utilization (GPU spatial multiplexing).
GPU kernel-based model resource requirement estimator.
Heuristic-based interference-aware resource utilization-maximizing scheduler that decides the batch size, model replication degree, and model placement.
Operator graph merger to merge multiple models to minimize interference in GPU cache.
USTC & Huawei & ByteDance & Hunan University
Tensor Language Model (TLM)
MSRA
Sydney & Alibaba
The code is currently not available.
UChicago & ECNU & MSR
Stanford & Princeton & Sapienza University of Rome & UMich
Huawei Central Software Institute & SJTU
Hong-Meng kernel (HM)
Columbia & Microsoft Azure & UW & Carl Waldspurger Consulting & Intel & UW-Madison & UMich
UPenn & SJTU & Princeton & Microsoft & UW
Stanford & Cornell & Enfabrica
ETH & Amsterdam & AMD
EPFL
LPN: Latency Petri Net
PKU & Alibaba Cloud
UIUC & UW-Madison & VMware Research & Feldera
Best Paper Award
Notes from SJTU IPADS (in Chinese)
Optimizing Resource Allocation in Hyperscale Datacenters: Scalability, Usability, and Experiences []
When will my ML Job finish? Toward providing Completion Time Estimates through Predictability-Centric Scheduling [] []
MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale []
nnScaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training [] []
Usher: Holistic Interference Avoidance for Resource Optimized ML Inference [] []
Enabling Tensor Language Model to Assist in Generating High-Performance Tensor Programs for Deep Learning [] []
Ladder: Enabling Efficient Low-Precision Deep Learning Computing through Hardware-aware Tensor Transformation [] []
MonoNN: Enabling a New Monolithic Optimization Space for Neural Network Inference Tasks on Modern GPU-Centric Architectures [] []
ChameleonAPI: Automatic and Efficient Customization of Neural Networks for ML Applications [] []
Caravan: Practical Online Learning of In-Network ML Models with Labeling Agents [] []
Microkernel Goes General: Performance and Compatibility in the HongMeng Production Microkernel []
Managing Memory Tiers with CXL in Virtualized Environments []
Beaver: Practical Partial Snapshots for Distributed Cloud Services [] []
High-throughput and Flexible Host Networking for Accelerated Computing [] []
ACCL+: an FPGA-Based Collective Engine for Distributed Applications []
Performance Interfaces for Hardware Accelerators [] []
Burstable Cloud Block Storage with Data Processing Units []
Anvil: Verifying Liveness of Cluster Management Controllers [] []