Orion — a system that transparently intercepts GPU kernel launches from multiple clients sharing a GPU
It schedules work on the GPU at the granularity of individual operators and minimizes interference by taking into account each operator’s compute and memory requirements
Integrated into PyTorch
Technical details
Influence the behavior of the hardware scheduler by using CUDA stream priorities
CUDA Events to monitor the progress of each stream in the GPU
Schedule each cudaMemcpy operation by considering its PCIe bandwidth requirements and current bus bandwidth utilization