Slashing the disaggregation tax in heterogeneous data centers with FractOS
#rCUDA #distributed_OS #disaggregated_system #GPU_adaptor #device_adaptor
Meta Info
Presented in EuroSys 2022.
Homepage:: https://lsds.doc.ic.ac.uk/projects/fractos
Understanding the paper
TL;DR
This paper presents FractOS, a distributed OS that is designed to minimize the network overheads of disaggregation in heterogeneous data centers.
It enables direct P2P data transfers between different devices, without centralized application and OS control.
Existing problem
Current software stacks introduce unnecessary messages through the shared data-center network in a disaggregated system.
How to manage accelerators (GPUs)
Compared to rCUDA
rCUDA accesses remote GPUs transparently by interposing CUDA driver calls.
FractOS GPU service uses a single roundtrip Request invocation per kernel invocation.
FractOS
Build a GPU adaptor to expose a disaggregated GPU.
The GPU adaptor runs on the host CPU, using the OS GPU driver, and offers several RPCs exposed through Requests: GPU context initialization, memory de/allocation, kernel loading, kernel invocation, and cleanup.
Implementation
17.5K LoC of C++.
Last updated