Revisiting network support for RDMA
An improved RoCE NIC (IRN) design that makes a few simple changes to the RoCE NIC for better handling of packets.
Meta Info
Presented in SIGCOMM 2018.
Authors: Radhika Mittal (UC Berkeley), Alexander Shpiner (Mellanox), Aurojit Panda (NYU), Eitan Zahavi (Mellanox), Arvind Krishnamurthy (UW), Sylvia Ratnasamy, Scott Shenker (UC Berkeley).
Understanding the paper
This paper proposes an improved RoCE NIC (IRN) design that makes a few simple changes to the RoCE NIC for better handling of packets.
It shows that PFC is not fundamentally required to support RoCE.
It shows that IRN (without PFC) outperforms RoCE (with PFC) by 6-83% for typical network scenarios.
Background
Infiniband RDMA
Long used in the HPC community.
Using credit-based flow control to make the network lossless.
Not designed to efficiently recover from packet losses, because packet drops are rare in such clusters.
Mechanism
When the receiver receives an out-of-order packet, it simply discards it and sends a negative acknowledgement (NACK) to the sender.
When the sender sees a NACK, it retransmits all packets that were sent after the last acknowledged packet (i.e., it performs a go-back-N retransmission).
RoCE enables the use of RDMA over Ethernet (also IP-routed networks).
Adopt the same Infiniband transport design.
Using PFC to make the network lossless.
Priority Flow Control (PFC)
Ethernet’s flow control mechanism.
A switch sends a pause (or X-OFF) frame to the upstream entity (a switch or a NIC), when the queue exceeds a certain configured threshold.
When the queue drains below this threshold, an X-ON frame is sent to resume transmission.
Limitation: various performance issues; make network harder to understand and manage.
iWARP vs RoCE
iWARP implement the entire TCP stack in hardware; need to translate TCP's byte stream semantics to RDMA segments.
iWARP is significantly more complex and expensive than RoCE, with inferior performance.
Central Question
Does RDMA require a lossless network (which includes PFC)?
The answer is no.
Key Designs
Improve the loss recovery mechanism.
Selective retransmission (inspired by TCP’s loss recovery).
BDP-FC mechanism.
Basic end-to-end packet level flow control, which bounds the number of in-flight packets by the bandwidth-delay product (BDP) of the network (as suggested in pFabric).
Last updated