SOSP 2024

Meta Info

Homepage: https://sigops.org/s/conferences/sosp/2024/

Paper List

Papers

Large Language Models (LLMs)

  • LLM Training

    • ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation [Paper] [Slides] [arXiv]

      • Stanford

      • Dynamically re-route the work of a failed server to data-parallel peers; execute within bubbles of the original pipeline schedule.

    • Enabling Parallelism Hot Switching for Efficient Training of Large Language Models [Paper] [Code]

      • PKU

      • HotSPa — a system that adopts multiple parallelism strategies for efficient training with sequence inputs

        • Classify a mini-batch of training samples into several groups and train each group with the most suitable parallelism strategy.

        • The graph compiler generates multiple executable computation graphs that share the same backbone storage of model states.

        • Deduces efficient many-to-many communication plans of parallelism hot switching.

    • Tenplex: Dynamic Parallelism for Deep Learning using Parallelizable Tensor Collections [Paper] [Code] [arXiv]

      • ICL & Aalto University & Edinburgh

      • Tenplex — a state management library.

        • Enable jobs to change the parallelism dynamically.

        • PTC: Parallelizable Tensor Collection

          • Dataset state

          • Modle state

        • Execute PTC transformations in parallel with minimum data movement between workers.

    • Reducing Energy Bloat in Large Model Training [Paper] [Code] [arXiv]

      • UMich

      • Perseus: use a graph cut-based algorithm to obtain the "iteration time-energy" Pareto frontier; schedule the energy consumption across time.

  • LLM Inference

    • LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism [Paper] [Code] [arXiv]

      • PKU

      • ESP: Elastic Sequence Parallelism

      • Elastically adjust the degree of parallelism in real-time; reduce key-value cache migration overhead and overlap partial decoding communication with computation; reduce key-value cache fragmentation across instances.

    • PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU [Paper] [Code] [arXiv]

      • SJTU IPADS

      • GPU-CPU hybrid inference engine: hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU.

      • Integrate adaptive predictors and neuron-aware sparse operators.

Model Serving

  • Improving DNN Inference Throughput using Practical, Per-Input Compute Adaptation [Paper]

    • GaTech & Princeton & Stanford

    • E3E^3 (Efficient Early-Exits): trade-off accuracy and resource costs on a per-input granularity; early-exit models.

    • Key insight: split and replicate blocks of layers in models; maintain a constant batch size throughout execution.

  • Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving [Paper] [Code] [arXiv]

    • Princeton & GaTech

    • Automatically apply and manage early exits (certain inputs can exit with results at intermediate layers) in ML models.

ML Compilation

  • Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor [Paper] [arXiv]

    • UIUC & MSRA

    • T10, the first DL compiler to exploit the inter-core communication bandwidth and distributed on-chip memory on AI chips (i.e., Graphcore IPU).

  • SilvanForge: A Schedule-Guided Retargetable Compiler for Decision Tree Inference [Paper]

    • Indian Institute of Science & MSR

    • Search over several optimization choices and automatically generate high-performance inference routines for CPUs and GPUs.

    • Two components: (1) a scheduling language that encapsulates the optimization space; (2) an optimizing retargetable compiler that can generate code for any specified schedule.

Serverless Computing

  • Dirigent: Lightweight Serverless Orchestration [Paper] [Code] [arXiv]

    • ETH

    • Simplify state management of the existing orchestration system (Kubernetes); eliminate persistent state updates; run monolithic control and data planes to minimize internal communication overheads.

Last updated