githubEdit

OSDI 2024

Meta Info

Homepage: https://www.usenix.org/conference/osdi24arrow-up-right

Paper list: https://www.usenix.org/conference/osdi24/technical-sessionsarrow-up-right

Acceptance Rate

19.2% (= 53 / 276)

Papers

Large Language Models (LLMs)

  • Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve [Paperarrow-up-right] [Codearrow-up-right]

    • MSR India & GaTech

    • Sarathi-Serve

      • Chunked-prefills: split a prefill request into near equal-sized chunks; create stall-free schedules that add new requests in a batch without pausing ongoing decodes.

      • Stall-free scheduling: improve throughput with large batch sizes; minimize the effect of batching on latency.

  • ServerlessLLM: Low-Latency Serverless Inference for Large Language Models [Paperarrow-up-right] [Codearrow-up-right]

    • Edinburgh

    • Multi-tier checkpoint loading.

    • Live migration of LLM inference: the source server migrates only the tokens; a re-computation of the KV-cache is triggered at the destination server.

    • Use cost models to estimate the time of loading checkpoints from different tiers in the storage hierarchy and the time of migrating an LLM inference to another server; choose the best server to minimize model startup latency.

  • InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management [Paperarrow-up-right]

    • Seoul National University

    • InfiniGen: a KV cache management framework for long-text generation.

    • Key insight: A few important tokens can be speculated by performing a minimal rehearsal with the inputs of the current layer and part of the query weight and key cache of the subsequent layer.

    • Prefetch only the essential KV cache entries instead of fetching them all. -> Mitigate the fetch overhead from the host memory.

  • Llumnix: Dynamic Scheduling for Large Language Model Serving [Paperarrow-up-right] [Codearrow-up-right]

    • Alibaba

    • Reschedule requests to improve load-balancing and isolation, mitigate resource fragmentation, and differentiate request priorities and SLOs.

    • Live migration for requests and the in-memory states (tokens).

  • DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving [Paperarrow-up-right] [Codearrow-up-right]

    • PKU & UCSD

    • Disaggregate the prefill and decoding computation.

    • Co-optimize the resource allocation and parallelism strategy for each phase; consider the cluster's bandwidth to minimize the communication overhead.

  • dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving [Paperarrow-up-right]

    • PKU & Shanghai AI Lab

    • A credit-based batching algorithm to decide when to merge and unmerge LoRA adapters with the base model.

    • A request-adapter co-migration algorithm to decide when to migrate between different worker replicas.

  • Parrot: Efficient Serving of LLM-based Applications with Semantic Variable [Paperarrow-up-right] [Codearrow-up-right]

    • SJTU & MSRA

    • Semantic Variable: a unified abstraction to expose application-level knowledge to public LLM services.

      • Annotate an input/output variable in the prompt of a request.

      • Create the data pipeline when connecting multiple LLM requests.

      • Allow to perform conventional data flow analysis to uncover the correlation across multiple LLM requests.

    • Implemented on Python.

  • Fairness in Serving Large Language Models [Paperarrow-up-right] [Codearrow-up-right]

    • UC Berkeley

    • This is the first work to discuss the fair serving of LLMs.

    • Propose a fair-serving algorithm called Virtual Token Counter (VTC).

      • Track the services received for each client.

      • Prioritize the ones with the least services received.

      • Only manipulate the dispatch order and don't reject a request if it can fit in the batch.

Resource Allocation

  • Optimizing Resource Allocation in Hyperscale Datacenters: Scalability, Usability, and Experiences [Paperarrow-up-right]

    • Meta Platforms

    • Main challenges for a resource-allocation framework.

      • Usability: how to translate real-life policies into precise mathematical formulas.

      • Scalability: NP-hard problems that cannot be solved efficiently by commercial solvers.

    • Rebalancer: Meta's resource-allocation framework.

      • An expression graph that enables its optimization algorithm to run more efficiently than past algorithms (for scalability).

      • A high-level specification language to lower the barrier for adoption by system practitioners (for usability).

Job Scheduling

  • When will my ML Job finish? Toward providing Completion Time Estimates through Predictability-Centric Scheduling [Paperarrow-up-right] [Codearrow-up-right]

    • Tufts

    • PCS: Predictability-Centric Scheduling

    • Use Weighted-Fair-Queueing (WFQ) and find a suitable configuration of different WFQ parameters (e.g., queue weights).

    • Use a simulation-aided search strategy to discover WFQ configurations.

  • MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale [Paperarrow-up-right]

    • Meta Platforms

    • MAST: ML Application Scheduler on Twine

    • Provide a global-scheduling abstraction to all ML training workloads.

    • Three design principles: temporal decoupling, scope decoupling, and exhaustive search.

Auto Parallelization

  • nnScaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training [Paperarrow-up-right] [Codearrow-up-right]

    • USTC & MSRA & xAI & BaseBit Technologies

    • Empower domain experts to construct their own search space through three primitives, op-trans, op-assign, and op-order.

    • Allow the application of constraints to those primitives during space construction.

Machine Learning Inference

  • Usher: Holistic Interference Avoidance for Resource Optimized ML Inference [Paperarrow-up-right] [Codearrow-up-right]

    • UVA & GaTech

    • Usher: an interference-aware ML serving system to maximize resource utilization (GPU spatial multiplexing).

      • GPU kernel-based model resource requirement estimator.

      • Heuristic-based interference-aware resource utilization-maximizing scheduler that decides the batch size, model replication degree, and model placement.

      • Operator graph merger to merge multiple models to minimize interference in GPU cache.

Tensor Program Generation

  • Enabling Tensor Language Model to Assist in Generating High-Performance Tensor Programs for Deep Learning [Paperarrow-up-right] [Codearrow-up-right]

    • USTC & Huawei & ByteDance & Hunan University

    • Tensor Language Model (TLM)

  • Ladder: Enabling Efficient Low-Precision Deep Learning Computing through Hardware-aware Tensor Transformation [Paperarrow-up-right] [Codearrow-up-right]

    • MSRA

  • MonoNN: Enabling a New Monolithic Optimization Space for Neural Network Inference Tasks on Modern GPU-Centric Architectures [Paperarrow-up-right] [Codearrow-up-right]

    • Sydney & Alibaba

    • The code is currently not available.

Machine Learning APIs

In-Network Machine Learning

Microkernel

  • Microkernel Goes General: Performance and Compatibility in the HongMeng Production Microkernel [Paperarrow-up-right]

    • Huawei Central Software Institute & SJTU

    • Hong-Meng kernel (HM)

  • Managing Memory Tiers with CXL in Virtualized Environments [Paperarrow-up-right]

    • Columbia & Microsoft Azure & UW & Carl Waldspurger Consulting & Intel & UW-Madison & UMich

Distributed Snapshots

Network Interface Card (NIC)

Collective Communication Library

  • ACCL+: an FPGA-Based Collective Engine for Distributed Applications [Paperarrow-up-right]

    • ETH & Amsterdam & AMD

Hardware Accelerators

Cloud Block Storage

  • Burstable Cloud Block Storage with Data Processing Units [Paperarrow-up-right]

    • PKU & Alibaba Cloud

Formal Verification

References

Last updated