Orion: Interference-aware, fine-grained GPU sharing for ML applications
Last updated
Was this helpful?
Last updated
Was this helpful?
Presented in .
Orion — a system that transparently intercepts GPU kernel launches from multiple clients sharing a GPU
It schedules work on the GPU at the granularity of individual operators and minimizes interference by taking into account each operator’s compute and memory requirements
Integrated into PyTorch
Influence the behavior of the hardware scheduler by using CUDA stream priorities
CUDA Events to monitor the progress of each stream in the GPU
Schedule each cudaMemcpy
operation by considering its PCIe bandwidth requirements and current bus bandwidth utilization
Use and to collect the compute throughput, memory throughput, and execution time of each kernel
Baselines
Temporal sharing — time-slice the GPU by executing one job’s request at a time
NVIDIA MPS
CUDA Streams