Revisiting network support for RDMA

An improved RoCE NIC (IRN) design that makes a few simple changes to the RoCE NIC for better handling of packets.

Meta Info

Presented in SIGCOMM 2018.

Authors: Radhika Mittal (UC Berkeley), Alexander Shpiner (Mellanox), Aurojit Panda (NYU), Eitan Zahavi (Mellanox), Arvind Krishnamurthy (UW), Sylvia Ratnasamy, Scott Shenker (UC Berkeley).

Understanding the paper

  • This paper proposes an improved RoCE NIC (IRN) design that makes a few simple changes to the RoCE NIC for better handling of packets.

    • It shows that PFC is not fundamentally required to support RoCE.

    • It shows that IRN (without PFC) outperforms RoCE (with PFC) by 6-83% for typical network scenarios.

Background

  • Infiniband RDMA

    • Long used in the HPC community.

    • Using credit-based flow control to make the network lossless.

    • Not designed to efficiently recover from packet losses, because packet drops are rare in such clusters.

    • Mechanism

      • When the receiver receives an out-of-order packet, it simply discards it and sends a negative acknowledgement (NACK) to the sender.

      • When the sender sees a NACK, it retransmits all packets that were sent after the last acknowledged packet (i.e., it performs a go-back-N retransmission).

  • RoCE enables the use of RDMA over Ethernet (also IP-routed networks).

    • Adopt the same Infiniband transport design.

    • Using PFC to make the network lossless.

  • Priority Flow Control (PFC)

    • Ethernet’s flow control mechanism.

    • A switch sends a pause (or X-OFF) frame to the upstream entity (a switch or a NIC), when the queue exceeds a certain configured threshold.

    • When the queue drains below this threshold, an X-ON frame is sent to resume transmission.

    • Limitation: various performance issues; make network harder to understand and manage.

  • iWARP vs RoCE

    • iWARP implement the entire TCP stack in hardware; need to translate TCP's byte stream semantics to RDMA segments.

    • iWARP is significantly more complex and expensive than RoCE, with inferior performance.

Central Question

  • Does RDMA require a lossless network (which includes PFC)?

    • The answer is no.

Key Designs

  • Improve the loss recovery mechanism.

    • Selective retransmission (inspired by TCP’s loss recovery).

  • BDP-FC mechanism.

    • Basic end-to-end packet level flow control, which bounds the number of in-flight packets by the bandwidth-delay product (BDP) of the network (as suggested in pFabric).

Last updated