📜
Awesome Papers
  • Introduction
  • Paper List
    • Systems for ML
      • Data Processing
      • Deep Learning Training
      • Resource Scheduler
      • Model Serving
      • Large Language Model (LLM)
      • Diffusion Models
      • Deep Learning Recommendation Model (DLRM)
      • Mixture of Experts (MoE)
      • Hyper-Parameter Tuning (HPO)
      • Reinforcement Learning (RL)
      • Deep Learning Compiler
      • Deep Learning Framework
      • Cloud-Edge Collaboration
    • ML for Systems
    • Artificial Intelligence (AI)
      • Diffusion Models
      • Language Models
      • Deep Learning Recommendation Model (DLRM)
    • Hardware Virtualization
      • GPU Sharing
    • Resource Disaggregation
      • GPU Disaggregation
      • Memory Disaggregation
    • Resource Fragmentation
    • Cloud Computing
      • Sky Computing
      • Serverless Computing
      • Spot Instances
    • Remote Direct Memory Access (RDMA)
    • Research Skills
    • Miscellaneous
  • Reading Notes
    • Conference
      • ICML 2025
      • ATC 2025
      • OSDI 2025
      • HotOS 2025
      • MLSys 2025
      • NSDI 2025
      • ASPLOS 2025
      • EuroSys 2025
      • HPCA 2025
      • PPoPP 2025
      • NeurIPS 2024
      • SoCC 2024
      • HotNets 2024
      • SC 2024
      • SOSP 2024
      • VLDB 2024
      • SIGCOMM 2024
      • ICML 2024
      • ATC 2024
      • OSDI 2024
      • ISCA 2024
      • CVPR 2024
      • MLSys 2024
      • ASPLOS 2024
        • SpotServe: Serving generative large language models on preemptible instances
      • EuroSys 2024
        • Orion: Interference-aware, fine-grained GPU sharing for ML applications
      • NSDI 2024
      • NeurIPS 2023
      • SC 2023
        • Interference-aware multiplexing for deep learning in GPU clusters: A middleware approach
      • SoCC 2023
      • SOSP 2023
        • UGache: A unified GPU cache for embedding-based deep learning
      • SIGCOMM 2023
      • HotChips 2023
      • ICML 2023
      • ATC 2023
        • Accelerating Distributed MoE Training and Inference with Lina
        • SmartMoE: Efficiently Training Sparsely-Activated Models ...
        • Beware of Fragmentation: Scheduling GPU-Sharing Workloads with Fragmentation Gradient Descent
      • OSDI 2023
      • HotOS 2023
      • SIGMOD 2023
      • ISCA 2023
      • MLSys 2023
      • EuroSys 2023
      • NSDI 2023
        • Shepherd: Serving DNNs in the wild
        • Understanding RDMA microarchitecture resources for performance isolation
        • Skyplane: Optimizing transfer cost and throughput using cloud-aware overlays
        • Shockwave: Fair and efficient cluster scheduling for dynamic adaptation in machine learning
      • ASPLOS 2023
        • TPP: Transparent page placement for CXL-enabled tiered-memory
        • EVStore: Storage and caching capabilities for scaling embedding tables in deep recommendation system
        • Lucid: A non-intrusive, scalable and interpretable scheduler for deep learning training jobs
      • SC 2022
      • SoCC 2022
        • ESCHER: Expressive scheduling with ephemeral resources
        • Serving unseen deep learning model with near-optimal configurations: A fast adaptive search approach
      • SIGCOMM 2022
        • Multi-resource interleaving for deep learning training
      • ATC 2022
        • PilotFish: Harvesting Free Cycles of Cloud Gaming with Deep Learning Training
        • Memory Harvesting in Multi-GPU Systems with Hierarchical Unified Virtual Memory
        • Whale: Efficient Giant Model Training over Heterogeneous GPUs
        • DVABatch: Diversity-aware Multi-Entry Multi-Exit Batching for Efficient Processing of DNN Service...
        • Serving Heterogeneous Machine Learning Models on Multi-GPU Servers with Spatio-Temporal Sharing
        • SOTER: Guarding Black-box Inference for General Neural Networks at the Edge
        • Direct access, high-performance memory disaggregation with DirectCXL
      • OSDI 2022
        • Orca: A distributed serving system for transformer-based generative models
        • Microsecond-scale preemption for concurrent GPU-accelerated DNN inferences
        • Looking beyond GPUs for DNN scheduling on multi-tenant clusters
      • IPDPS 2022
        • DGSF: Disaggregated GPUs for serverless functions
      • EuroSys 2022
        • Slashing the disaggregation tax in heterogeneous data centers with FractOS
      • NSDI 2022
      • SoCC 2021
      • ATC 2021
        • Zico: Efficient GPU memory sharing for concurrent DNN training
      • OSDI 2021
        • Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning
      • SOSP 2021
        • HeMem: Scalable Tiered Memory Management for Big Data Applications and Real NVM
      • EuroSys 2021
        • Take it to the limit: Peak prediction-driven resource overcommitment in datacenters
      • HotOS 2021
        • From cloud computing to sky computing
      • NSDI 2021
      • OSDI 2020
        • A unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters
        • HiveD: Sharing a GPU cluster for deep learning with guarantees
      • ATC 2020
        • Serverless in the wild: Characterizing and optimizing the serverless workload
      • EuroSys 2020
      • ASPLOS 2020
      • MLSys 2020
      • SoCC 2020
        • Elastic Parameter Server Load Distribution in Deep Learning Clusters
      • HPDC 2020
        • KubeShare: A framework to manage GPUs as first-class and shared resources in container cloud
      • CLUSTER 2019
      • EuroSys 2019
      • NSDI 2019
      • IWQoS 2019
        • Who limits the resource efficiency of my datacenter: An analysis of Alibaba datacenter traces
      • SIGCOMM 2018
        • Revisiting network support for RDMA
      • OSDI 2018
        • Ray: A distributed framework for emerging AI applications
      • EuroSys 2018
        • Medea: Scheduling of long running applications in shared production clusters
      • ISPA/IUCC/BDCloud/SocialCom/SustainCom 2018
        • GaiaGPU: Sharing GPUs in container clouds
      • SoCC 2017
        • SLAQ: Quality-driven scheduling for distributed machine learning
      • ASPLOS 2017
        • Neurosurgeon: Collaborative intelligence between the cloud and mobile edge
      • NSDI 2017
        • Clipper: A low-latency online prediction serving system
      • CLUSTER 2014
        • Evaluating job packing in warehouse-scale computing
    • Journal
      • IEEE Transactions on Cloud Computing
        • 2021
          • Gemini: Enabling multi-tenant GPU sharing based on kernel burst estimation
      • ACM Computing Surveys
        • 2017
          • GPU virtualization and scheduling methods: A comprehensive survey
      • ACM SIGCOMM Computer Communication Review (CCR)
        • 2021
          • Data-driven Networking Research: models for academic collaboration with industry
        • 2007
          • How to Read a Paper
      • Communications of the ACM
        • 2015
          • Why Google stores billions of lines of code in a single repository
    • Miscellaneous
      • arXiv
        • 2024
          • Efficiently programming large language models using SGLang
        • 2023
          • HexGen: Generative inference of foundation model over heterogeneous decentralized environment
          • High-throughput generative inference of large language models with a single GPU
        • 2022
          • DisaggRec: Architecting disaggregated systems for large-scale personalized recommendation
          • A case for disaggregation of ML data processing
          • Singularity: Planet-scale, preemptive and elastic scheduling of AI workloads
          • Aryl: An elastic cluster scheduler for deep learning
        • 2016
          • Wide & deep learning for recommender systems
          • Training deep nets with sublinear memory cost
      • MSR Technical Report
        • 2011
          • Heuristics for vector bin packing
  • About Myself
    • Academic Profile
    • Personal Blog (in Chinese)
Powered by GitBook
On this page
  • Meta Info
  • Acceptance Rate
  • Papers
  • Large Language Models (LLMs)
  • Resource Allocation
  • Job Scheduling
  • Auto Parallelization
  • Machine Learning Inference
  • Tensor Program Generation
  • Machine Learning APIs
  • In-Network Machine Learning
  • Microkernel
  • Compute Express Link (CXL)
  • Distributed Snapshots
  • Network Interface Card (NIC)
  • Collective Communication Library
  • Hardware Accelerators
  • Cloud Block Storage
  • Formal Verification
  • References

Was this helpful?

Edit on GitHub
  1. Reading Notes
  2. Conference

OSDI 2024

Last updated 2 months ago

Was this helpful?

Meta Info

Homepage:

Paper list:

Acceptance Rate

19.2% (= 53 / 276)

Papers

Large Language Models (LLMs)

  • Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve [] []

    • MSR India & GaTech

    • Sarathi-Serve

      • Chunked-prefills: split a prefill request into near equal-sized chunks; create stall-free schedules that add new requests in a batch without pausing ongoing decodes.

      • Stall-free scheduling: improve throughput with large batch sizes; minimize the effect of batching on latency.

  • ServerlessLLM: Low-Latency Serverless Inference for Large Language Models [] []

    • Edinburgh

    • Multi-tier checkpoint loading.

    • Live migration of LLM inference: the source server migrates only the tokens; a re-computation of the KV-cache is triggered at the destination server.

    • Use cost models to estimate the time of loading checkpoints from different tiers in the storage hierarchy and the time of migrating an LLM inference to another server; choose the best server to minimize model startup latency.

  • InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management []

    • Seoul National University

    • InfiniGen: a KV cache management framework for long-text generation.

    • Key insight: A few important tokens can be speculated by performing a minimal rehearsal with the inputs of the current layer and part of the query weight and key cache of the subsequent layer.

    • Prefetch only the essential KV cache entries instead of fetching them all. -> Mitigate the fetch overhead from the host memory.

  • Llumnix: Dynamic Scheduling for Large Language Model Serving [] []

    • Alibaba

    • Reschedule requests to improve load-balancing and isolation, mitigate resource fragmentation, and differentiate request priorities and SLOs.

    • Live migration for requests and the in-memory states (tokens).

  • DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving [] []

    • PKU & UCSD

    • Disaggregate the prefill and decoding computation.

    • Co-optimize the resource allocation and parallelism strategy for each phase; consider the cluster's bandwidth to minimize the communication overhead.

  • dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving []

    • PKU & Shanghai AI Lab

    • A credit-based batching algorithm to decide when to merge and unmerge LoRA adapters with the base model.

    • A request-adapter co-migration algorithm to decide when to migrate between different worker replicas.

  • Parrot: Efficient Serving of LLM-based Applications with Semantic Variable [] []

    • SJTU & MSRA

    • Semantic Variable: a unified abstraction to expose application-level knowledge to public LLM services.

      • Annotate an input/output variable in the prompt of a request.

      • Create the data pipeline when connecting multiple LLM requests.

      • Allow to perform conventional data flow analysis to uncover the correlation across multiple LLM requests.

    • Implemented on Python.

  • Fairness in Serving Large Language Models [] []

    • UC Berkeley

    • This is the first work to discuss the fair serving of LLMs.

    • Propose a fair-serving algorithm called Virtual Token Counter (VTC).

      • Track the services received for each client.

      • Prioritize the ones with the least services received.

      • Only manipulate the dispatch order and don't reject a request if it can fit in the batch.

Resource Allocation

    • Meta Platforms

    • Main challenges for a resource-allocation framework.

      • Usability: how to translate real-life policies into precise mathematical formulas.

      • Scalability: NP-hard problems that cannot be solved efficiently by commercial solvers.

    • Rebalancer: Meta's resource-allocation framework.

      • An expression graph that enables its optimization algorithm to run more efficiently than past algorithms (for scalability).

      • A high-level specification language to lower the barrier for adoption by system practitioners (for usability).

Job Scheduling

    • Tufts

    • PCS: Predictability-Centric Scheduling

    • Use Weighted-Fair-Queueing (WFQ) and find a suitable configuration of different WFQ parameters (e.g., queue weights).

    • Use a simulation-aided search strategy to discover WFQ configurations.

    • Meta Platforms

    • MAST: ML Application Scheduler on Twine

    • Provide a global-scheduling abstraction to all ML training workloads.

    • Three design principles: temporal decoupling, scope decoupling, and exhaustive search.

Auto Parallelization

    • USTC & MSRA & xAI & BaseBit Technologies

    • Empower domain experts to construct their own search space through three primitives, op-trans, op-assign, and op-order.

    • Allow the application of constraints to those primitives during space construction.

Machine Learning Inference

    • UVA & GaTech

    • Usher: an interference-aware ML serving system to maximize resource utilization (GPU spatial multiplexing).

      • GPU kernel-based model resource requirement estimator.

      • Heuristic-based interference-aware resource utilization-maximizing scheduler that decides the batch size, model replication degree, and model placement.

      • Operator graph merger to merge multiple models to minimize interference in GPU cache.

Tensor Program Generation

    • USTC & Huawei & ByteDance & Hunan University

    • Tensor Language Model (TLM)

    • MSRA

    • Sydney & Alibaba

    • The code is currently not available.

Machine Learning APIs

    • UChicago & ECNU & MSR

In-Network Machine Learning

    • Stanford & Princeton & Sapienza University of Rome & UMich

Microkernel

    • Huawei Central Software Institute & SJTU

    • Hong-Meng kernel (HM)

Compute Express Link (CXL)

    • Columbia & Microsoft Azure & UW & Carl Waldspurger Consulting & Intel & UW-Madison & UMich

Distributed Snapshots

    • UPenn & SJTU & Princeton & Microsoft & UW

Network Interface Card (NIC)

    • Stanford & Cornell & Enfabrica

Collective Communication Library

    • ETH & Amsterdam & AMD

Hardware Accelerators

    • EPFL

    • LPN: Latency Petri Net

Cloud Block Storage

    • PKU & Alibaba Cloud

Formal Verification

    • UIUC & UW-Madison & VMware Research & Feldera

    • Best Paper Award

References

  • Notes from SJTU IPADS (in Chinese)

Optimizing Resource Allocation in Hyperscale Datacenters: Scalability, Usability, and Experiences []

When will my ML Job finish? Toward providing Completion Time Estimates through Predictability-Centric Scheduling [] []

MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale []

nnScaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training [] []

Usher: Holistic Interference Avoidance for Resource Optimized ML Inference [] []

Enabling Tensor Language Model to Assist in Generating High-Performance Tensor Programs for Deep Learning [] []

Ladder: Enabling Efficient Low-Precision Deep Learning Computing through Hardware-aware Tensor Transformation [] []

MonoNN: Enabling a New Monolithic Optimization Space for Neural Network Inference Tasks on Modern GPU-Centric Architectures [] []

ChameleonAPI: Automatic and Efficient Customization of Neural Networks for ML Applications [] []

Caravan: Practical Online Learning of In-Network ML Models with Labeling Agents [] []

Microkernel Goes General: Performance and Compatibility in the HongMeng Production Microkernel []

Managing Memory Tiers with CXL in Virtualized Environments []

Beaver: Practical Partial Snapshots for Distributed Cloud Services [] []

High-throughput and Flexible Host Networking for Accelerated Computing [] []

ACCL+: an FPGA-Based Collective Engine for Distributed Applications []

Performance Interfaces for Hardware Accelerators [] []

Burstable Cloud Block Storage with Data Processing Units []

Anvil: Verifying Liveness of Cluster Management Controllers [] []

https://www.usenix.org/conference/osdi24
https://www.usenix.org/conference/osdi24/technical-sessions
Paper
Code
Paper
Code
Paper
Paper
Code
Paper
Code
Paper
Paper
Code
Paper
Code
Paper
Paper
Code
Paper
Paper
Code
Paper
Code
Paper
Code
Paper
Code
Paper
Code
Paper
Code
Paper
Code
Paper
Paper
Paper
Code
Paper
Code
Paper
Paper
Code
Paper
Paper
Code
OSDI 2024 论文评述 Day 1 Session 1: Memory Management - IPADS-SYS 的文章 - 知乎
OSDI 2024 论文评述 Day 1 Session 2: Low-Latency LLM Serving - IPADS-SYS 的文章 - 知乎
OSDI 2024 论文评述 Day 1 Session 3: Distributed Systems - IPADS-SYS 的文章 - 知乎
OSDI 2024 论文评述 Day 2 Session 4: Deep Learning - IPADS-SYS 的文章 - 知乎
OSDI 2024 论文评述 Day 2 Session 5: Operating Systems - IPADS-SYS 的文章 - 知乎
OSDI 2024 论文评述 Day 2 Session 6: Cloud Computing - IPADS-SYS 的文章 - 知乎
OSDI 2024 论文评述 Day 2 Session 7: Formal Verification - IPADS-SYS 的文章 - 知乎
OSDI 2024 论文评述 Day 3 Session 8: Cloud Security - IPADS-SYS 的文章 - 知乎
OSDI 2024 论文评述 Day 3 Session 9: Data Management - IPADS-SYS 的文章 - 知乎
OSDI 2024 论文评述 Day 3 Session 10: Analysis of Correctness - IPADS-SYS 的文章 - 知乎
OSDI 2024 论文评述 Day 3 Session 11: ML Scheduling - IPADS-SYS 的文章 - 知乎