Orion: Interference-aware, fine-grained GPU sharing for ML applications
Meta Info
Presented in EuroSys 2024.
Understanding the paper
- Orion — a system that transparently intercepts GPU kernel launches from multiple clients sharing a GPU 
- It schedules work on the GPU at the granularity of individual operators and minimizes interference by taking into account each operator’s compute and memory requirements 
- Integrated into PyTorch 
Technical details
- Influence the behavior of the hardware scheduler by using CUDA stream priorities 
- CUDA Events to monitor the progress of each stream in the GPU 
- Schedule each - cudaMemcpyoperation by considering its PCIe bandwidth requirements and current bus bandwidth utilization
- Use NVIDIA Nsight Compute and NVIDIA Nsight Systems to collect the compute throughput, memory throughput, and execution time of each kernel 
Evaluation
- Baselines - Temporal sharing — time-slice the GPU by executing one job’s request at a time 
- NVIDIA MPS 
- CUDA Streams 
 
Last updated
Was this helpful?