Orion: Interference-aware, fine-grained GPU sharing for ML applications

Meta Info

Presented in EuroSys 2024.

Understanding the paper

  • Orion — a system that transparently intercepts GPU kernel launches from multiple clients sharing a GPU

  • It schedules work on the GPU at the granularity of individual operators and minimizes interference by taking into account each operator’s compute and memory requirements

  • Integrated into PyTorch

Technical details

  • Influence the behavior of the hardware scheduler by using CUDA stream priorities

  • CUDA Events to monitor the progress of each stream in the GPU

  • Schedule each cudaMemcpy operation by considering its PCIe bandwidth requirements and current bus bandwidth utilization

  • Use NVIDIA Nsight Compute and NVIDIA Nsight Systems to collect the compute throughput, memory throughput, and execution time of each kernel

Evaluation

  • Baselines

    • Temporal sharing — time-slice the GPU by executing one job’s request at a time

    • NVIDIA MPS

    • CUDA Streams

Last updated