# OSDI 2024

## Meta Info

Homepage: <https://www.usenix.org/conference/osdi24>

Paper list: <https://www.usenix.org/conference/osdi24/technical-sessions>

### Acceptance Rate

19.2% (= 53 / 276)

## Papers

### Large Language Models (LLMs)

* Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve \[[Paper](https://www.usenix.org/conference/osdi24/presentation/agrawal)] \[[Code](https://github.com/microsoft/sarathi-serve)]
  * MSR India & GaTech
  * **Sarathi-Serve**
    * Chunked-prefills: split a prefill request into *near equal-sized chunks*; create stall-free schedules that add new requests in a batch *without pausing ongoing decodes*.
    * Stall-free scheduling: improve throughput with large batch sizes; minimize the effect of batching on latency.
* ServerlessLLM: Low-Latency Serverless Inference for Large Language Models \[[Paper](https://www.usenix.org/conference/osdi24/presentation/fu)] \[[Code](https://github.com/ServerlessLLM/ServerlessLLM)]
  * Edinburgh
  * Multi-tier checkpoint loading.
  * Live migration of LLM inference: the source server migrates only the tokens; a re-computation of the KV-cache is triggered at the destination server.
  * Use cost models to estimate the time of loading checkpoints from different tiers in the storage hierarchy and the time of migrating an LLM inference to another server; choose the best server to minimize model startup latency.
* InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management \[[Paper](https://www.usenix.org/conference/osdi24/presentation/lee)]
  * Seoul National University
  * **InfiniGen**: a *KV cache management* framework for *long-text generation*.
  * Key insight: A few important tokens can be speculated by performing a minimal rehearsal with the inputs of the current layer and part of the query weight and key cache of the subsequent layer.
  * Prefetch only the essential KV cache entries instead of fetching them all. -> Mitigate the fetch overhead from the host memory.
* Llumnix: Dynamic Scheduling for Large Language Model Serving \[[Paper](https://www.usenix.org/conference/osdi24/presentation/sun-biao)] \[[Code](https://github.com/AlibabaPAI/llumnix)]
  * Alibaba
  * *Reschedule requests* to improve load-balancing and isolation, mitigate resource fragmentation, and differentiate request priorities and SLOs.
  * Live migration for requests and the in-memory states (tokens).
* DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving \[[Paper](https://www.usenix.org/conference/osdi24/presentation/zhong-yinmin)] \[[Code](https://github.com/LLMServe/DistServe)]
  * PKU & UCSD
  * Disaggregate the prefill and decoding computation.
  * Co-optimize the resource allocation and parallelism strategy for each phase; consider the cluster's bandwidth to minimize the communication overhead.
* dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving \[[Paper](https://www.usenix.org/conference/osdi24/presentation/wu-bingyang)]
  * PKU & Shanghai AI Lab
  * A credit-based batching algorithm to decide when to *merge and unmerge* LoRA adapters with the base model.
  * A request-adapter co-migration algorithm to decide when to *migrate* between different worker replicas.
* Parrot: Efficient Serving of LLM-based Applications with Semantic Variable \[[Paper](https://www.usenix.org/conference/osdi24/presentation/lin-chaofan)] \[[Code](https://github.com/microsoft/ParrotServe)]
  * SJTU & MSRA
  * **Semantic Variable**: a unified abstraction to expose application-level knowledge to public LLM services.
    * Annotate an input/output variable in the prompt of a request.
    * Create the data pipeline when connecting multiple LLM requests.
    * Allow to perform conventional data flow analysis to uncover the correlation across multiple LLM requests.
  * Implemented on Python.
* Fairness in Serving Large Language Models \[[Paper](https://www.usenix.org/conference/osdi24/presentation/sheng)] \[[Code](https://github.com/Ying1123/VTC-artifact)]
  * UC Berkeley
  * This is the *first* work to discuss the *fair serving* of LLMs.
  * Propose a fair-serving algorithm called **Virtual Token Counter** (**VTC**).
    * Track the services received for each client.
    * Prioritize the ones with the least services received.
    * Only manipulate the dispatch order and don't reject a request if it can fit in the batch.

### Resource Allocation

* Optimizing Resource Allocation in Hyperscale Datacenters: Scalability, Usability, and Experiences \[[Paper](https://www.usenix.org/conference/osdi24/presentation/kumar)]
  * Meta Platforms
  * Main challenges for a resource-allocation framework.
    * Usability: how to translate real-life policies into precise mathematical formulas.
    * Scalability: NP-hard problems that cannot be solved efficiently by commercial solvers.
  * **Rebalancer**: Meta's resource-allocation framework.
    * An expression graph that enables its optimization algorithm to run more efficiently than past algorithms (for scalability).
    * A high-level specification language to lower the barrier for adoption by system practitioners (for usability).

### Job Scheduling

* When will my ML Job finish? Toward providing Completion Time Estimates through Predictability-Centric Scheduling \[[Paper](https://www.usenix.org/conference/osdi24/presentation/bin-faisal)] \[[Code](https://github.com/TuftsNATLab/PCS)]
  * Tufts
  * PCS: Predictability-Centric Scheduling
  * Use Weighted-Fair-Queueing (WFQ) and find a suitable configuration of different WFQ parameters (e.g., queue weights).
  * Use a simulation-aided search strategy to discover WFQ configurations.
* MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale \[[Paper](https://www.usenix.org/conference/osdi24/presentation/choudhury)]
  * Meta Platforms
  * MAST: ML Application Scheduler on Twine
  * Provide a global-scheduling abstraction to all ML training workloads.
  * Three design principles: temporal decoupling, scope decoupling, and exhaustive search.

### Auto Parallelization

* nnScaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training \[[Paper](https://www.usenix.org/conference/osdi24/presentation/lin-zhiqi)] \[[Code](https://github.com/microsoft/nnscaler)]
  * USTC & MSRA & xAI & BaseBit Technologies
  * Empower domain experts to construct their own search space through three primitives, `op-trans`, `op-assign`, and `op-order`.
  * Allow the application of constraints to those primitives during space construction.

### Machine Learning Inference

* Usher: Holistic Interference Avoidance for Resource Optimized ML Inference \[[Paper](https://www.usenix.org/conference/osdi24/presentation/shubha)] \[[Code](https://github.com/ss7krd/Usher)]
  * UVA & GaTech
  * Usher: an interference-aware ML serving system to maximize resource utilization (GPU spatial multiplexing).
    * GPU kernel-based model resource requirement estimator.
    * Heuristic-based interference-aware resource utilization-maximizing scheduler that decides the batch size, model replication degree, and model placement.
    * Operator graph merger to merge multiple models to minimize interference in GPU cache.

### Tensor Program Generation

* Enabling Tensor Language Model to Assist in Generating High-Performance Tensor Programs for Deep Learning \[[Paper](https://www.usenix.org/conference/osdi24/presentation/zhai)] \[[Code](https://github.com/zhaiyi000/tlm)]
  * USTC & Huawei & ByteDance & Hunan University
  * Tensor Language Model (TLM)
* Ladder: Enabling Efficient Low-Precision Deep Learning Computing through Hardware-aware Tensor Transformation \[[Paper](https://www.usenix.org/conference/osdi24/presentation/wang-lei)] \[[Code](https://github.com/microsoft/BitBLAS/tree/osdi24_ladder_artifact)]
  * MSRA
* MonoNN: Enabling a New Monolithic Optimization Space for Neural Network Inference Tasks on Modern GPU-Centric Architectures \[[Paper](https://www.usenix.org/conference/osdi24/presentation/zhuang)] \[[Code](https://github.com/AlibabaResearch/mononn)]
  * Sydney & Alibaba
  * The code is currently not available.

### Machine Learning APIs

* ChameleonAPI: Automatic and Efficient Customization of Neural Networks for ML Applications \[[Paper](https://www.usenix.org/conference/osdi24/presentation/liu)] \[[Code](https://github.com/UChi-JCL/chameleonAPI)]
  * UChicago & ECNU & MSR

### In-Network Machine Learning

* Caravan: Practical Online Learning of In-Network ML Models with Labeling Agents \[[Paper](https://www.usenix.org/conference/osdi24/presentation/zhang-qizheng)] \[[Code](https://github.com/Per-Packet-AI/Caravan-Artifact-OSDI24)]
  * Stanford & Princeton & Sapienza University of Rome & UMich

### Microkernel

* Microkernel Goes General: Performance and Compatibility in the HongMeng Production Microkernel \[[Paper](https://www.usenix.org/conference/osdi24/presentation/chen-haibo)]
  * Huawei Central Software Institute & SJTU
  * Hong-Meng kernel (HM)

### Compute Express Link (CXL)

* Managing Memory Tiers with CXL in Virtualized Environments \[[Paper](https://www.usenix.org/conference/osdi24/presentation/zhong-yuhong)]
  * Columbia & Microsoft Azure & UW & Carl Waldspurger Consulting & Intel & UW-Madison & UMich

### Distributed Snapshots

* Beaver: Practical Partial Snapshots for Distributed Cloud Services \[[Paper](https://www.usenix.org/conference/osdi24/presentation/yu)] \[[Code](https://github.com/eniac/Beaver)]
  * UPenn & SJTU & Princeton & Microsoft & UW

### Network Interface Card (NIC)

* High-throughput and Flexible Host Networking for Accelerated Computing \[[Paper](https://www.usenix.org/conference/osdi24/presentation/skiadopoulos)] \[[Code](https://github.com/enfabrica/iperf)]
  * Stanford & Cornell & Enfabrica

### Collective Communication Library

* ACCL+: an FPGA-Based Collective Engine for Distributed Applications \[[Paper](https://www.usenix.org/conference/osdi24/presentation/he)]
  * ETH & Amsterdam & AMD

### Hardware Accelerators

* Performance Interfaces for Hardware Accelerators \[[Paper](https://www.usenix.org/conference/osdi24/presentation/ma-jiacheng)] \[[Code](https://github.com/dslab-epfl/lpn)]
  * EPFL
  * LPN: Latency Petri Net

### Cloud Block Storage <a href="#page-title" id="page-title"></a>

* Burstable Cloud Block Storage with Data Processing Units \[[Paper](https://www.usenix.org/conference/osdi24/presentation/shu)]
  * PKU & Alibaba Cloud

### Formal Verification

* Anvil: Verifying Liveness of Cluster Management Controllers \[[Paper](https://www.usenix.org/conference/osdi24/presentation/sun-xudong)] \[[Code](https://github.com/vmware-research/verifiable-controllers)]
  * UIUC & UW-Madison & VMware Research & Feldera
  * **Best Paper Award**

## References

* Notes from SJTU IPADS (in Chinese)
  * [OSDI 2024 论文评述 Day 1 Session 1: Memory Management - IPADS-SYS 的文章 - 知乎](https://zhuanlan.zhihu.com/p/707983034)
  * [OSDI 2024 论文评述 Day 1 Session 2: Low-Latency LLM Serving - IPADS-SYS 的文章 - 知乎](https://zhuanlan.zhihu.com/p/707990822)
  * [OSDI 2024 论文评述 Day 1 Session 3: Distributed Systems - IPADS-SYS 的文章 - 知乎](https://zhuanlan.zhihu.com/p/707998884)
  * [OSDI 2024 论文评述 Day 2 Session 4: Deep Learning - IPADS-SYS 的文章 - 知乎](https://zhuanlan.zhihu.com/p/708002201)
  * [OSDI 2024 论文评述 Day 2 Session 5: Operating Systems - IPADS-SYS 的文章 - 知乎](https://zhuanlan.zhihu.com/p/708003676)
  * [OSDI 2024 论文评述 Day 2 Session 6: Cloud Computing - IPADS-SYS 的文章 - 知乎](https://zhuanlan.zhihu.com/p/708034284)
  * [OSDI 2024 论文评述 Day 2 Session 7: Formal Verification - IPADS-SYS 的文章 - 知乎](https://zhuanlan.zhihu.com/p/708035509)
  * [OSDI 2024 论文评述 Day 3 Session 8: Cloud Security - IPADS-SYS 的文章 - 知乎](https://zhuanlan.zhihu.com/p/708036283)
  * [OSDI 2024 论文评述 Day 3 Session 9: Data Management - IPADS-SYS 的文章 - 知乎](https://zhuanlan.zhihu.com/p/708037149)
  * [OSDI 2024 论文评述 Day 3 Session 10: Analysis of Correctness - IPADS-SYS 的文章 - 知乎](https://zhuanlan.zhihu.com/p/708037498)
  * [OSDI 2024 论文评述 Day 3 Session 11: ML Scheduling - IPADS-SYS 的文章 - 知乎](https://zhuanlan.zhihu.com/p/708038262)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://paper.lingyunyang.com/reading-notes/conference/osdi-2024.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
