# Remote Direct Memory Access (RDMA)

## General RDMA

* X-RDMA: Effective RDMA Middleware in Large-scale Production Environments ([CLUSTER 2019](https://paper.lingyunyang.com/reading-notes/conference/cluster-2019)) \[[Paper](https://ieeexplore.ieee.org/document/8891004)]
  * Alibaba
  * Focus on robustness, scalability, and maintainability.
* FreeFlow: Software-based Virtual RDMA Networking for Containerized Clouds ([NSDI 2019](https://paper.lingyunyang.com/reading-notes/conference/nsdi-2019)) \[[Paper](https://www.usenix.org/conference/nsdi19/presentation/kim)] \[[Code](https://github.com/Microsoft/Freeflow)]
  * CMU & Microsoft & Alibaba & ByteDance
  * A software-based RDMA virtualization framework designed for containerized clouds.
* Revisiting Network Support for RDMA ([SIGCOMM 2018](https://paper.lingyunyang.com/reading-notes/conference/sigcomm-2018)) \[[Personal Notes](https://paper.lingyunyang.com/reading-notes/conference/sigcomm-2018/irn)] \[[Paper](https://dl.acm.org/doi/10.1145/3230543.3230557)]
  * UC Berkeley & ICSI & Mellanox & NYU & UW
  * IRN: Better handling of packet losses; eliminate the need for PFC.
* RDMA over Commodity Ethernet at Scale (SIGCOMM 2016) \[[Paper](https://dl.acm.org/doi/10.1145/2934872.2934908)]
  * Microsoft
  * Challenges using *RoCEv2*; a DSCP (Differentiated Services Code Point) based PFC mechanism.
* Congestion Control for Large-Scale RDMA Deployments (SIGCOMM 2015) \[[Paper](https://dl.acm.org/doi/10.1145/2785956.2787484)]
  * Microsoft & Mellanox & UCSB
  * DCQCN: A congestion control scheme for *RoCEv2*, to alleviate the problems of PFC.

## RDMA for Deep Learning

* Fast Distributed Deep Learning over RDMA (EuroSys 2019) \[[Paper](https://dl.acm.org/doi/10.1145/3302424.3303975)]
  * MSRA
* Towards Zero Copy Dataflows using RDMA (SIGCOMM 2017 Posters and Demos) \[[Paper](https://dl.acm.org/doi/10.1145/3123878.3131975)] \[[Code](https://github.com/tensorflow/networking/tree/master/tensorflow_networking/gdr)]
  * HKUST
  * Merged into TensorFlow.

## RDMA for Storage

* Empowering Azure Storage with RDMA ([NSDI 2023](https://paper.lingyunyang.com/reading-notes/conference/nsdi-2023)) \[[Paper](https://www.usenix.org/conference/nsdi23/presentation/bai)]
  * Microsoft
  * Production experience in Microsoft Azure
  * Around **70%** of traffic in Azure is RDM&#x41;**.**
* When Cloud Storage Meets RDMA ([NSDI 2021](https://paper.lingyunyang.com/reading-notes/conference/nsdi-2021)) \[[Paper](https://www.usenix.org/conference/nsdi21/presentation/gao)]
  * NJU & Alibaba
  * Pangu
  * Production experience in Alibaba Cloud
  * Two workarounds to handle PFC storms: *shutdown, RDMA/TCP switching*.

## Performance Isolation

* Understanding RDMA Microarchitecture Resources for Performance Isolation ([NSDI 2023](https://paper.lingyunyang.com/reading-notes/conference/nsdi-2023)) \[[Personal Notes](https://paper.lingyunyang.com/reading-notes/conference/nsdi-2023/husky)] \[[Paper](https://www.usenix.org/conference/nsdi23/presentation/kong)] \[[Benchmark Suite](https://github.com/host-bench/husky)]
  * Duke & Microsoft & SJTU
  * Develop a *test suite* to *evaluate* RDMA performance isolation solutions.

## Acronyms

* PFC: Priority Flow Control
* RoCE: RDMA over Converged Ethernet
* IBoE: InfiniBand over Ethernet
