# Revisiting network support for RDMA

## Meta Info

Presented in [SIGCOMM 2018](https://doi.org/10.1145/3230543.3230557).

Authors: Radhika Mittal (*UC Berkeley*), Alexander Shpiner (*Mellanox*), Aurojit Panda (*NYU*), Eitan Zahavi (*Mellanox*), Arvind Krishnamurthy (*UW*), Sylvia Ratnasamy, Scott Shenker (*UC Berkeley*).

## Understanding the paper

* This paper proposes an **improved RoCE NIC (IRN)** design that *makes a few simple changes to the RoCE NIC* for *better handling of packets*.
  * It shows that PFC is not *fundamentally required* to support RoCE.
  * It shows that *IRN (without PFC)* outperforms *RoCE (with PFC)* by 6-83% for typical network scenarios.

### Background

* Infiniband RDMA
  * Long used in the HPC community.
  * Using *credit-based flow control* to make the network *lossless*.
  * Not designed to efficiently recover from packet losses, because *packet drops are rare* in such clusters.
  * Mechanism
    * When the receiver receives an out-of-order packet, it simply *discards* it and *sends a negative acknowledgement (NACK)* to the sender.
    * When the sender sees a NACK, it *retransmits all packets* that were sent after the last acknowledged packet (i.e., it performs *a go-back-N retransmission*).
* RoCE enables the use of RDMA over Ethernet (also IP-routed networks).
  * Adopt *the same Infiniband transport design*.
  * Using PFC to make the network lossless.
* Priority Flow Control (PFC)
  * Ethernet’s flow control mechanism.
  * A switch *sends a pause (or X-OFF) frame* to the upstream entity (a switch or a NIC), when the queue exceeds a certain configured threshold.
  * When the queue drains below this threshold, *an X-ON frame is sent to resume transmission*.
  * Limitation: various performance issues; make network harder to understand and manage.
* iWARP vs RoCE
  * **iWARP** implement the entire TCP stack in hardware; need to translate TCP's byte stream semantics to RDMA segments.
  * **iWARP** is *significantly more complex and expensive* than **RoCE**, with *inferior performance*.

### Central Question

* Does RDMA require a lossless network (which includes PFC)?
  * The answer is no.

### Key Designs

* Improve the loss recovery mechanism.
  * *Selective retransmission* (inspired by TCP’s loss recovery).
* BDP-FC mechanism.
  * *Basic end-to-end packet level flow control*, which *bounds the number of in-flight packets* by the bandwidth-delay product (BDP) of the network (as suggested in pFabric).
