Orion: Interference-aware, fine-grained GPU sharing for ML applications
Meta Info
Presented in EuroSys 2024.
Understanding the paper
Orion — a system that transparently intercepts GPU kernel launches from multiple clients sharing a GPU
It schedules work on the GPU at the granularity of individual operators and minimizes interference by taking into account each operator’s compute and memory requirements
Integrated into PyTorch
Technical details
Influence the behavior of the hardware scheduler by using CUDA stream priorities
CUDA Events to monitor the progress of each stream in the GPU
Schedule each
cudaMemcpy
operation by considering its PCIe bandwidth requirements and current bus bandwidth utilizationUse NVIDIA Nsight Compute and NVIDIA Nsight Systems to collect the compute throughput, memory throughput, and execution time of each kernel
Evaluation
Baselines
Temporal sharing — time-slice the GPU by executing one job’s request at a time
NVIDIA MPS
CUDA Streams
Last updated