# OSDI 2024

## Meta Info

Homepage: <https://www.usenix.org/conference/osdi24>

Paper list: <https://www.usenix.org/conference/osdi24/technical-sessions>

### Acceptance Rate

19.2% (= 53 / 276)

## Papers

### Large Language Models (LLMs)

* Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve \[[Paper](https://www.usenix.org/conference/osdi24/presentation/agrawal)] \[[Code](https://github.com/microsoft/sarathi-serve)]
  * MSR India & GaTech
  * **Sarathi-Serve**
    * Chunked-prefills: split a prefill request into *near equal-sized chunks*; create stall-free schedules that add new requests in a batch *without pausing ongoing decodes*.
    * Stall-free scheduling: improve throughput with large batch sizes; minimize the effect of batching on latency.
* ServerlessLLM: Low-Latency Serverless Inference for Large Language Models \[[Paper](https://www.usenix.org/conference/osdi24/presentation/fu)] \[[Code](https://github.com/ServerlessLLM/ServerlessLLM)]
  * Edinburgh
  * Multi-tier checkpoint loading.
  * Live migration of LLM inference: the source server migrates only the tokens; a re-computation of the KV-cache is triggered at the destination server.
  * Use cost models to estimate the time of loading checkpoints from different tiers in the storage hierarchy and the time of migrating an LLM inference to another server; choose the best server to minimize model startup latency.
* InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management \[[Paper](https://www.usenix.org/conference/osdi24/presentation/lee)]
  * Seoul National University
  * **InfiniGen**: a *KV cache management* framework for *long-text generation*.
  * Key insight: A few important tokens can be speculated by performing a minimal rehearsal with the inputs of the current layer and part of the query weight and key cache of the subsequent layer.
  * Prefetch only the essential KV cache entries instead of fetching them all. -> Mitigate the fetch overhead from the host memory.
* Llumnix: Dynamic Scheduling for Large Language Model Serving \[[Paper](https://www.usenix.org/conference/osdi24/presentation/sun-biao)] \[[Code](https://github.com/AlibabaPAI/llumnix)]
  * Alibaba
  * *Reschedule requests* to improve load-balancing and isolation, mitigate resource fragmentation, and differentiate request priorities and SLOs.
  * Live migration for requests and the in-memory states (tokens).
* DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving \[[Paper](https://www.usenix.org/conference/osdi24/presentation/zhong-yinmin)] \[[Code](https://github.com/LLMServe/DistServe)]
  * PKU & UCSD
  * Disaggregate the prefill and decoding computation.
  * Co-optimize the resource allocation and parallelism strategy for each phase; consider the cluster's bandwidth to minimize the communication overhead.
* dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving \[[Paper](https://www.usenix.org/conference/osdi24/presentation/wu-bingyang)]
  * PKU & Shanghai AI Lab
  * A credit-based batching algorithm to decide when to *merge and unmerge* LoRA adapters with the base model.
  * A request-adapter co-migration algorithm to decide when to *migrate* between different worker replicas.
* Parrot: Efficient Serving of LLM-based Applications with Semantic Variable \[[Paper](https://www.usenix.org/conference/osdi24/presentation/lin-chaofan)] \[[Code](https://github.com/microsoft/ParrotServe)]
  * SJTU & MSRA
  * **Semantic Variable**: a unified abstraction to expose application-level knowledge to public LLM services.
    * Annotate an input/output variable in the prompt of a request.
    * Create the data pipeline when connecting multiple LLM requests.
    * Allow to perform conventional data flow analysis to uncover the correlation across multiple LLM requests.
  * Implemented on Python.
* Fairness in Serving Large Language Models \[[Paper](https://www.usenix.org/conference/osdi24/presentation/sheng)] \[[Code](https://github.com/Ying1123/VTC-artifact)]
  * UC Berkeley
  * This is the *first* work to discuss the *fair serving* of LLMs.
  * Propose a fair-serving algorithm called **Virtual Token Counter** (**VTC**).
    * Track the services received for each client.
    * Prioritize the ones with the least services received.
    * Only manipulate the dispatch order and don't reject a request if it can fit in the batch.

### Resource Allocation

* Optimizing Resource Allocation in Hyperscale Datacenters: Scalability, Usability, and Experiences \[[Paper](https://www.usenix.org/conference/osdi24/presentation/kumar)]
  * Meta Platforms
  * Main challenges for a resource-allocation framework.
    * Usability: how to translate real-life policies into precise mathematical formulas.
    * Scalability: NP-hard problems that cannot be solved efficiently by commercial solvers.
  * **Rebalancer**: Meta's resource-allocation framework.
    * An expression graph that enables its optimization algorithm to run more efficiently than past algorithms (for scalability).
    * A high-level specification language to lower the barrier for adoption by system practitioners (for usability).

### Job Scheduling

* When will my ML Job finish? Toward providing Completion Time Estimates through Predictability-Centric Scheduling \[[Paper](https://www.usenix.org/conference/osdi24/presentation/bin-faisal)] \[[Code](https://github.com/TuftsNATLab/PCS)]
  * Tufts
  * PCS: Predictability-Centric Scheduling
  * Use Weighted-Fair-Queueing (WFQ) and find a suitable configuration of different WFQ parameters (e.g., queue weights).
  * Use a simulation-aided search strategy to discover WFQ configurations.
* MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale \[[Paper](https://www.usenix.org/conference/osdi24/presentation/choudhury)]
  * Meta Platforms
  * MAST: ML Application Scheduler on Twine
  * Provide a global-scheduling abstraction to all ML training workloads.
  * Three design principles: temporal decoupling, scope decoupling, and exhaustive search.

### Auto Parallelization

* nnScaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training \[[Paper](https://www.usenix.org/conference/osdi24/presentation/lin-zhiqi)] \[[Code](https://github.com/microsoft/nnscaler)]
  * USTC & MSRA & xAI & BaseBit Technologies
  * Empower domain experts to construct their own search space through three primitives, `op-trans`, `op-assign`, and `op-order`.
  * Allow the application of constraints to those primitives during space construction.

### Machine Learning Inference

* Usher: Holistic Interference Avoidance for Resource Optimized ML Inference \[[Paper](https://www.usenix.org/conference/osdi24/presentation/shubha)] \[[Code](https://github.com/ss7krd/Usher)]
  * UVA & GaTech
  * Usher: an interference-aware ML serving system to maximize resource utilization (GPU spatial multiplexing).
    * GPU kernel-based model resource requirement estimator.
    * Heuristic-based interference-aware resource utilization-maximizing scheduler that decides the batch size, model replication degree, and model placement.
    * Operator graph merger to merge multiple models to minimize interference in GPU cache.

### Tensor Program Generation

* Enabling Tensor Language Model to Assist in Generating High-Performance Tensor Programs for Deep Learning \[[Paper](https://www.usenix.org/conference/osdi24/presentation/zhai)] \[[Code](https://github.com/zhaiyi000/tlm)]
  * USTC & Huawei & ByteDance & Hunan University
  * Tensor Language Model (TLM)
* Ladder: Enabling Efficient Low-Precision Deep Learning Computing through Hardware-aware Tensor Transformation \[[Paper](https://www.usenix.org/conference/osdi24/presentation/wang-lei)] \[[Code](https://github.com/microsoft/BitBLAS/tree/osdi24_ladder_artifact)]
  * MSRA
* MonoNN: Enabling a New Monolithic Optimization Space for Neural Network Inference Tasks on Modern GPU-Centric Architectures \[[Paper](https://www.usenix.org/conference/osdi24/presentation/zhuang)] \[[Code](https://github.com/AlibabaResearch/mononn)]
  * Sydney & Alibaba
  * The code is currently not available.

### Machine Learning APIs

* ChameleonAPI: Automatic and Efficient Customization of Neural Networks for ML Applications \[[Paper](https://www.usenix.org/conference/osdi24/presentation/liu)] \[[Code](https://github.com/UChi-JCL/chameleonAPI)]
  * UChicago & ECNU & MSR

### In-Network Machine Learning

* Caravan: Practical Online Learning of In-Network ML Models with Labeling Agents \[[Paper](https://www.usenix.org/conference/osdi24/presentation/zhang-qizheng)] \[[Code](https://github.com/Per-Packet-AI/Caravan-Artifact-OSDI24)]
  * Stanford & Princeton & Sapienza University of Rome & UMich

### Microkernel

* Microkernel Goes General: Performance and Compatibility in the HongMeng Production Microkernel \[[Paper](https://www.usenix.org/conference/osdi24/presentation/chen-haibo)]
  * Huawei Central Software Institute & SJTU
  * Hong-Meng kernel (HM)

### Compute Express Link (CXL)

* Managing Memory Tiers with CXL in Virtualized Environments \[[Paper](https://www.usenix.org/conference/osdi24/presentation/zhong-yuhong)]
  * Columbia & Microsoft Azure & UW & Carl Waldspurger Consulting & Intel & UW-Madison & UMich

### Distributed Snapshots

* Beaver: Practical Partial Snapshots for Distributed Cloud Services \[[Paper](https://www.usenix.org/conference/osdi24/presentation/yu)] \[[Code](https://github.com/eniac/Beaver)]
  * UPenn & SJTU & Princeton & Microsoft & UW

### Network Interface Card (NIC)

* High-throughput and Flexible Host Networking for Accelerated Computing \[[Paper](https://www.usenix.org/conference/osdi24/presentation/skiadopoulos)] \[[Code](https://github.com/enfabrica/iperf)]
  * Stanford & Cornell & Enfabrica

### Collective Communication Library

* ACCL+: an FPGA-Based Collective Engine for Distributed Applications \[[Paper](https://www.usenix.org/conference/osdi24/presentation/he)]
  * ETH & Amsterdam & AMD

### Hardware Accelerators

* Performance Interfaces for Hardware Accelerators \[[Paper](https://www.usenix.org/conference/osdi24/presentation/ma-jiacheng)] \[[Code](https://github.com/dslab-epfl/lpn)]
  * EPFL
  * LPN: Latency Petri Net

### Cloud Block Storage <a href="#page-title" id="page-title"></a>

* Burstable Cloud Block Storage with Data Processing Units \[[Paper](https://www.usenix.org/conference/osdi24/presentation/shu)]
  * PKU & Alibaba Cloud

### Formal Verification

* Anvil: Verifying Liveness of Cluster Management Controllers \[[Paper](https://www.usenix.org/conference/osdi24/presentation/sun-xudong)] \[[Code](https://github.com/vmware-research/verifiable-controllers)]
  * UIUC & UW-Madison & VMware Research & Feldera
  * **Best Paper Award**

## References

* Notes from SJTU IPADS (in Chinese)
  * [OSDI 2024 论文评述 Day 1 Session 1: Memory Management - IPADS-SYS 的文章 - 知乎](https://zhuanlan.zhihu.com/p/707983034)
  * [OSDI 2024 论文评述 Day 1 Session 2: Low-Latency LLM Serving - IPADS-SYS 的文章 - 知乎](https://zhuanlan.zhihu.com/p/707990822)
  * [OSDI 2024 论文评述 Day 1 Session 3: Distributed Systems - IPADS-SYS 的文章 - 知乎](https://zhuanlan.zhihu.com/p/707998884)
  * [OSDI 2024 论文评述 Day 2 Session 4: Deep Learning - IPADS-SYS 的文章 - 知乎](https://zhuanlan.zhihu.com/p/708002201)
  * [OSDI 2024 论文评述 Day 2 Session 5: Operating Systems - IPADS-SYS 的文章 - 知乎](https://zhuanlan.zhihu.com/p/708003676)
  * [OSDI 2024 论文评述 Day 2 Session 6: Cloud Computing - IPADS-SYS 的文章 - 知乎](https://zhuanlan.zhihu.com/p/708034284)
  * [OSDI 2024 论文评述 Day 2 Session 7: Formal Verification - IPADS-SYS 的文章 - 知乎](https://zhuanlan.zhihu.com/p/708035509)
  * [OSDI 2024 论文评述 Day 3 Session 8: Cloud Security - IPADS-SYS 的文章 - 知乎](https://zhuanlan.zhihu.com/p/708036283)
  * [OSDI 2024 论文评述 Day 3 Session 9: Data Management - IPADS-SYS 的文章 - 知乎](https://zhuanlan.zhihu.com/p/708037149)
  * [OSDI 2024 论文评述 Day 3 Session 10: Analysis of Correctness - IPADS-SYS 的文章 - 知乎](https://zhuanlan.zhihu.com/p/708037498)
  * [OSDI 2024 论文评述 Day 3 Session 11: ML Scheduling - IPADS-SYS 的文章 - 知乎](https://zhuanlan.zhihu.com/p/708038262)
