# Orion: Interference-aware, fine-grained GPU sharing for ML applications

## Meta Info

Presented in [EuroSys 2024](https://anakli.inf.ethz.ch/papers/orion_eurosys24.pdf).

## Understanding the paper

* Orion — a system that transparently *intercepts GPU kernel launches* from multiple clients sharing a GPU
* It schedules work on the GPU *at the granularity of individual operators* and minimizes interference by taking into account *each operator’s compute and memory requirements*
* Integrated into PyTorch

### Technical details

* Influence the behavior of the hardware scheduler by using CUDA stream priorities
* CUDA Events to *monitor the progress of each stream* in the GPU
* Schedule each `cudaMemcpy` operation by *considering its PCIe bandwidth requirements and current bus bandwidth utilization*
* Use [NVIDIA Nsight Compute](https://developer.nvidia.com/nsight-compute) and [NVIDIA Nsight Systems](https://developer.nvidia.com/nsight-systems) to collect *the compute throughput, memory throughput, and execution time of each kernel*

### Evaluation

* Baselines
  * Temporal sharing — time-slice the GPU by executing one job’s request at a time
  * NVIDIA MPS
  * CUDA Streams
  * [REEF](https://www.usenix.org/conference/osdi22/presentation/han)
