# NSDI 2024

## Meta Info

Homepage: <https://www.usenix.org/conference/nsdi24>

Paper list: <https://www.usenix.org/conference/nsdi24/technical-sessions>

## Papers

### Resource Management

* Resiliency at Scale: Managing Google’s TPUv4 Machine Learning Supercomputer \[[Paper](https://www.usenix.org/conference/nsdi24/presentation/zu)]
  * Google
  * Experience in designing and operating the software infrastructure that allows TPUv4 supercomputers to operate at scale.
* Autothrottle: A Practical Bi-Level Approach to Resource Management for SLO-Targeted Microservices \[[Paper](https://www.usenix.org/conference/nsdi24/presentation/wang-zibo)] \[[Slides](https://www.usenix.org/system/files/nsdi24_slides-wang_zibo.pdf)] \[[Code](https://github.com/microsoft/autothrottle)]
  * USTC & ETH & MSR
  * Minimize CPU allocation of *microservice applications* while meeting SLO.
  * Service-level (low overhead & fast reaction) vs. Application-level (global visibility)
    * Captains (service-level): control based on throttle ratio target; collect data every 100ms, adjust allocation every 1s.
    * Tower (application-level): determine the best throttle targets for Captains to achieve; online learning (contextual bandit algorithm); one step per minute, each step runs in \~100ms.
* CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters \[[Paper](https://www.usenix.org/conference/nsdi24/presentation/rajasekaran)]
  * MIT & UT-Austin
  * Consider the communication pattern of different jobs while placing them on network links.

### Large Language Models (LLMs)

* LLM characterization
  * Characterization of Large Language Model Development in the Datacenter \[[Paper](https://www.usenix.org/conference/nsdi24/presentation/hu)] \[[Slides](https://www.usenix.org/system/files/nsdi24_slides-hu.pdf)] \[[Trace](https://github.com/InternLM/AcmeTrace)]
    * NTU & PKU & CUHK & Shanghai AI Lab
* LLM training
  * MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs \[[Paper](https://www.usenix.org/conference/nsdi24/presentation/jiang-ziheng)] \[[Slides](https://www.usenix.org/system/files/nsdi24_slides-jiang_ziheng.pdf)] \[[Code](https://github.com/volcengine/veScale)]
    * ByteDance & PKU

### Utilize Spot Instances

* Can't Be Late: Optimizing Spot Instance Savings under Deadlines \[[Paper](https://www.usenix.org/conference/nsdi24/presentation/wu-zhanghao)] \[[Trace](https://github.com/skypilot-org/spot-traces)]
  * UC Berkeley
  * **Outstanding Paper**
  * Characterization (e.g., availability, pricing, duration) of three-month-long spot availability traces on AWS.
  * **Uniform Progress**: a policy to make uniform progress towards the deadline, by distributing the job computation uniformly across the time.
* Parcae: Proactive, Liveput-Optimized DNN Training on Preemptible Instances \[[Paper](https://www.usenix.org/conference/nsdi24/presentation/duan)] \[[Slides](https://www.usenix.org/system/files/nsdi24_slides-duan.pdf)] \[[Code](https://github.com/JF-D/Parcae)]
  * CUHK & ByteDance & CMU & UCLA & Microsoft
  * Proactively adjust the parallelization strategy of a DNN training job for future preemptions to maximize preemption-aware throughput (i.e., liveput).

### Multimodal Models

* DISTMM: Accelerating Distributed Multimodal Model Training \[[Paper](https://www.usenix.org/conference/nsdi24/presentation/huang)]
  * Ohio State University & AWS
  * Partition and parallelize the submodules of a multimodal model based on their modalities and redistribute the training data.

### Diffusion Models

* Approximate Caching for Efficiently Serving Text-to-Image Diffusion Models \[[Paper](https://www.usenix.org/conference/nsdi24/presentation/agarwal-shubham)] \[[Slides](https://www.usenix.org/system/files/nsdi24_slides-agarwal_shubham.pdf)]
  * Adobe Research & UIUC
  * Approximate caching: reduce a certain number of denoising steps by reusing intermediate noise states created during a prior image generation.

### Deep Learning Recommendation Models (DLRMs)

* Accelerating Neural Recommendation Training with Embedding Scheduling \[[Paper](https://www.usenix.org/conference/nsdi24/presentation/zeng)] \[[Slides](https://www.usenix.org/system/files/nsdi24_slides-zeng.pdf)] \[[Code](https://github.com/HKUST-SING/herald)]
  * HKUST
  * **Herald**: an adaptive location-aware inputs allocator to determine *where embeddings should be trained* and an optimal communication plan generator to determine *which embeddings should be synchronized*.

### Fair Resource Allocation

* Solving Max-Min Fair Resource Allocations Quickly on Large Graphs \[[Paper](https://www.usenix.org/conference/nsdi24/presentation/namyar-solving)] \[[Slides](https://www.usenix.org/system/files/nsdi24_slides_namyar-solving.pdf)] \[[Code](https://github.com/microsoft/Soroush)]
  * Microsoft & USC & Rice
  * **Soroush**: Single-Shot Max-Min Fair Allocator.
  * Deployed on Microsoft WAN.

### Network Emulation

* Crescent: Emulating Heterogeneous Production Network at Scale \[[Paper](https://www.usenix.org/conference/nsdi24/presentation/gao-zhaoyu)] \[[Slides](https://www.usenix.org/system/files/nsdi24_slides-gao_zhaoyu.pdf)]
  * ByteDance & Cornell
  * **Crescent**: ByteDance’s *network emulation* platform for preventing *change-induced network incidents*.

### RDMA

* Harmonic: Hardware-assisted RDMA Performance Isolation for Public Clouds \[[Paper](https://www.usenix.org/conference/nsdi24/presentation/lou)]
  * UIUC & Duke & Microsoft
  * **Harmonic**: microarchitecture-resource-aware RDMA performance isolation; including a programmable intelligent PCIe switch (prototyped with FPGA) and an RDMA-friendly rate limiter.

### PCIe

* Understanding Routable PCIe Performance for Composable Infrastructures \[[Paper](https://www.usenix.org/conference/nsdi24/presentation/hou)]
  * UW-Madison & ZJU
  * **rPCIeBench**: a software-hardware co-designed benchmarking framework to systematically characterize the *routable PCIe fabric*.
