GPU Sharing
Last updated
Was this helpful?
Last updated
Was this helpful?
Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications () [] []
ETH
Intercept GPU kernel launches and schedule individual GPU operators
Utilize CUDA stream priorities; consider the PCIe bandwidth
Use NVIDIA Nsight Compute and NVIDIA Nsight Systems to collect the compute throughput, memory throughput, and execution time of each kernel
Interference-aware Multiplexing for Deep Learning in GPU Clusters: A Middleware Approach () [] [] []
UMacau & SIAT, CAS
IADeep — a cluster scheduler on top of Kubernetes
Tune training configurations (e.g., batch size) across all co-located tasks; choose appropriate tasks to multiplex on a GPU device; consider PCIe bandwidth
Transparent GPU Sharing in Container Clouds for Deep Learning Workloads () [] []
PKU
TGS: Transparent GPU sharing; adaptive rate control; unified memory.
Microsecond-scale Preemption for Concurrent GPU-accelerated DNN Inferences () [] [] [] [] []
SJTU
REEF: GPU kernel preemption; dynamic kernel padding.
Gemini: Enabling Multi-Tenant GPU Sharing Based on Kernel Burst Estimation () [] []
National Tsing Hua University
Enable fine-grained GPU allocation; launch kernels together.
AntMan: Dynamic Scaling on GPU Clusters for Deep Learning () [] []
Alibaba
Enable GPU sharing in DL frameworks (TensorFlow/PyTorch); schedule operators.
KubeShare: A Framework to Manage GPUs as First-Class and Shared Resources in Container Cloud () [] [] []
National Tsing Hua University
Kubernetes; CUDA API remoting.
GaiaGPU: Sharing GPUs in Container Clouds () [] [] []
PKU & Tencent
Kubernetes; CUDA API remoting.
UPenn & DBOS, Inc.
Instrument kernels to expose at runtime, detailed information about the occupancy and utilization of the GPU's SMs
Compiler-library-scheduler co-design
PKU & ByteDance
Utilize NVIDIA MPS
Partition the GPU into as many as seven instances, each fully isolated with its own high-bandwidth memory, cache, and compute cores.
Available for NVIDIA H100, A100, and A30 GPUs.
Transparently enable co-operative multi-process CUDA applications.
Terminating an MPS client without synchronizing with all outstanding GPU work (via Ctrl-C / program exception such as segfault / signals, etc.) can leave the MPS server and other MPS clients in an undefined state, which may result in hangs, unexpected failures, or corruptions.
Stream: a sequence of operations that execute in issue-order on the GPU.
Perform multiple CUDA operations simultaneously.
Queen’s University Belfast
Paella: Low-latency Model Serving with Software-defined GPU Scheduling () [] []
MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep Learning Clusters (arXiv 2303.13803) []
NVIDIA Multi-Instance GPU (MIG) []
NVIDIA Multi-Process Service (MPS) []
NVIDIA CUDA Multi-Stream []
GPU Virtualization and Scheduling Methods: A Comprehensive Survey () [] []