Ctrlk

Orion: Interference-aware, fine-grained GPU sharing for ML applications

Meta Info

Presented in EuroSys 2024.

Understanding the paper

Orion — a system that transparently intercepts GPU kernel launches from multiple clients sharing a GPU
It schedules work on the GPU at the granularity of individual operators and minimizes interference by taking into account each operator’s compute and memory requirements
Integrated into PyTorch

Technical details

Influence the behavior of the hardware scheduler by using CUDA stream priorities
CUDA Events to monitor the progress of each stream in the GPU
Schedule each cudaMemcpy operation by considering its PCIe bandwidth requirements and current bus bandwidth utilization
Use NVIDIA Nsight Compute and NVIDIA Nsight Systems to collect the compute throughput, memory throughput, and execution time of each kernel

Evaluation

Baselines
- Temporal sharing — time-slice the GPU by executing one job’s request at a time
- NVIDIA MPS
- CUDA Streams
- REEF

Last updated 1 year ago

Was this helpful?