SOSP 2024

Meta Info

Homepage: https://sigops.org/s/conferences/sosp/2024/

Papers

Large Language Models (LLMs)

  • LLM Training

    • Enabling Parallelism Hot Switching for Efficient Training of Large Language Models

      • PKU

    • Perseus: Removing Energy Bloat from Large Model Training [arXiv]

      • UMich

      • Use a graph cut-based algorithm to obtain the "iteration time-energy" Pareto frontier; schedule the energy consumption across time.

  • LLM Inference

    • LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism [arXiv]

      • PKU

        • ESP: Elastic Sequence Parallelism

        • Elastically adjust the degree of parallelism in real-time; reduce key-value cache migration overhead and overlap partial decoding communication with computation; reduce key-value cache fragmentation across instances.

    • PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU [arXiv]

      • SJTU IPADS

ML Serving

  • Improving DNN Inference Throughput using Practical, Per-Input Compute Adaptation

    • GaTech & Princeton

  • Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving [arXiv]

    • Princeton & GaTech

    • Automatically apply and manage early exits (certain inputs can exit with results at intermediate layers) in ML models.

Distributed Training

  • SlipStream: Adapting Pipelines for Distributed Training of Large DNNs Amid Failures [arXiv]

    • Stanford

    • Dynamically re-route the work of a failed server to data-parallel peers; execute within bubbles of the original pipeline schedule.

  • Tenplex: Dynamic Parallelism for Deep Learning using Parallelizable Tensor Collections [arXiv]

    • ICL

    • Tenplex — a state management library.

      • Enable jobs to change the parallelism dynamically.

      • PTC: Parallelizable Tensor Collection

        • Dataset state

        • Modle state

      • Execute PTC transformations in parallel with minimum data movement between workers.

ML Compilation

  • Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor [arXiv]

    • UIUC & MSRA

    • T10, the first DL compiler to exploit the inter-core communication bandwidth and distributed on-chip memory on AI chips (i.e., Graphcore IPU).

  • SilvanForge: A Schedule-Guided Retargetable Compiler for Decision Tree Inference

    • IISc

Serverless Computing

  • Dirigent: Lightweight Serverless Orchestration [arXiv]

    • ETH

    • Simplify state management of the existing orchestration system (Kubernetes); eliminate persistent state updates; run monolithic control and data planes to minimize internal communication overheads.

Last updated