NSDI 2024
Meta Info
Homepage: https://www.usenix.org/conference/nsdi24
Paper list: https://www.usenix.org/conference/nsdi24/technical-sessions
Papers
Resource Management
Resiliency at Scale: Managing Google’s TPUv4 Machine Learning Supercomputer [Paper]
Google
Experience in designing and operating the software infrastructure that allows TPUv4 supercomputers to operate at scale.
Autothrottle: A Practical Bi-Level Approach to Resource Management for SLO-Targeted Microservices [Paper] [Slides] [Code]
USTC & ETH & MSR
Minimize CPU allocation of microservice applications while meeting SLO.
Service-level (low overhead & fast reaction) vs. Application-level (global visibility)
Captains (service-level): control based on throttle ratio target; collect data every 100ms, adjust allocation every 1s.
Tower (application-level): determine the best throttle targets for Captains to achieve; online learning (contextual bandit algorithm); one step per minute, each step runs in ~100ms.
CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters [Paper]
MIT & UT-Austin
Consider the communication pattern of different jobs while placing them on network links.
Large Language Models (LLMs)
Utilize Spot Instances
Can't Be Late: Optimizing Spot Instance Savings under Deadlines [Paper] [Trace]
UC Berkeley
Outstanding Paper
Characterization (e.g., availability, pricing, duration) of three-month-long spot availability traces on AWS.
Uniform Progress: a policy to make uniform progress towards the deadline, by distributing the job computation uniformly across the time.
Multimodal Models
DISTMM: Accelerating Distributed Multimodal Model Training [Paper]
Ohio State University & AWS
Partition and parallelize the submodules of a multimodal model based on their modalities and redistribute the training data.
Diffusion Models
Deep Learning Recommendation Models (DLRMs)
Fair Resource Allocation
Network Emulation
RDMA
Harmonic: Hardware-assisted RDMA Performance Isolation for Public Clouds [Paper]
UIUC & Duke & Microsoft
Harmonic: microarchitecture-resource-aware RDMA performance isolation; including a programmable intelligent PCIe switch (prototyped with FPGA) and an RDMA-friendly rate limiter.
PCIe
Understanding Routable PCIe Performance for Composable Infrastructures [Paper]
UW-Madison & ZJU
rPCIeBench: a software-hardware co-designed benchmarking framework to systematically characterize the routable PCIe fabric.
Last updated