# Revisiting network support for RDMA

## Meta Info

Presented in [SIGCOMM 2018](https://doi.org/10.1145/3230543.3230557).

Authors: Radhika Mittal (*UC Berkeley*), Alexander Shpiner (*Mellanox*), Aurojit Panda (*NYU*), Eitan Zahavi (*Mellanox*), Arvind Krishnamurthy (*UW*), Sylvia Ratnasamy, Scott Shenker (*UC Berkeley*).

## Understanding the paper

* This paper proposes an **improved RoCE NIC (IRN)** design that *makes a few simple changes to the RoCE NIC* for *better handling of packets*.
  * It shows that PFC is not *fundamentally required* to support RoCE.
  * It shows that *IRN (without PFC)* outperforms *RoCE (with PFC)* by 6-83% for typical network scenarios.

### Background

* Infiniband RDMA
  * Long used in the HPC community.
  * Using *credit-based flow control* to make the network *lossless*.
  * Not designed to efficiently recover from packet losses, because *packet drops are rare* in such clusters.
  * Mechanism
    * When the receiver receives an out-of-order packet, it simply *discards* it and *sends a negative acknowledgement (NACK)* to the sender.
    * When the sender sees a NACK, it *retransmits all packets* that were sent after the last acknowledged packet (i.e., it performs *a go-back-N retransmission*).
* RoCE enables the use of RDMA over Ethernet (also IP-routed networks).
  * Adopt *the same Infiniband transport design*.
  * Using PFC to make the network lossless.
* Priority Flow Control (PFC)
  * Ethernet’s flow control mechanism.
  * A switch *sends a pause (or X-OFF) frame* to the upstream entity (a switch or a NIC), when the queue exceeds a certain configured threshold.
  * When the queue drains below this threshold, *an X-ON frame is sent to resume transmission*.
  * Limitation: various performance issues; make network harder to understand and manage.
* iWARP vs RoCE
  * **iWARP** implement the entire TCP stack in hardware; need to translate TCP's byte stream semantics to RDMA segments.
  * **iWARP** is *significantly more complex and expensive* than **RoCE**, with *inferior performance*.

### Central Question

* Does RDMA require a lossless network (which includes PFC)?
  * The answer is no.

### Key Designs

* Improve the loss recovery mechanism.
  * *Selective retransmission* (inspired by TCP’s loss recovery).
* BDP-FC mechanism.
  * *Basic end-to-end packet level flow control*, which *bounds the number of in-flight packets* by the bandwidth-delay product (BDP) of the network (as suggested in pFabric).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://paper.lingyunyang.com/reading-notes/conference/sigcomm-2018/irn.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
