Slashing the disaggregation tax in heterogeneous data centers with FractOS

#rCUDA #distributed_OS #disaggregated_system #GPU_adaptor #device_adaptor

Meta Info

Presented in EuroSys 2022.

Homepage:: https://lsds.doc.ic.ac.uk/projects/fractos

Understanding the paper

TL;DR

This paper presents FractOS, a distributed OS that is designed to minimize the network overheads of disaggregation in heterogeneous data centers.

It enables direct P2P data transfers between different devices, without centralized application and OS control.

Existing problem

  • Current software stacks introduce unnecessary messages through the shared data-center network in a disaggregated system.

How to manage accelerators (GPUs)

  • Compared to rCUDA

    • rCUDA accesses remote GPUs transparently by interposing CUDA driver calls.

    • FractOS GPU service uses a single roundtrip Request invocation per kernel invocation.

  • FractOS

    • Build a GPU adaptor to expose a disaggregated GPU.

    • The GPU adaptor runs on the host CPU, using the OS GPU driver, and offers several RPCs exposed through Requests: GPU context initialization, memory de/allocation, kernel loading, kernel invocation, and cleanup.

Implementation

  • 17.5K LoC of C++.

Last updated