# Understanding RDMA microarchitecture resources for performance isolation

## Meta Info

Presented in [NSDI 2023](https://www.usenix.org/conference/nsdi23/presentation/kong).

Authors: Xinhao Kong, Jingrong Chen (*Duke University*), Wei Bai (*Microsoft*), Yechen Xu (SJTU), Mahmoud Elhaddad, Shachar Raindel, Jitendra Padhye (*Microsoft*), Alvin R. Lebeck, Danyang Zhuo (*Duke University*).

Benchmark suite: <https://github.com/host-bench/husky>

## Understanding the paper

### Background

* Cloud providers are working towards supporting RDMA in *general-purpose guest VMs* to benefit third-party workloads (as opposed to first-party workloads such as storage and ML). Thus they must provide *performance isolation* for tenants sharing the same RNIC.
* RDMA brings unique challenges to network performance isolation due to its complex NIC microarchitecture resources (e.g., NIC cache and processing units).
* This work looks at *how these microarchitecture resources affect RDMA performance isolation* from *a public cloud provider’s perspective*. The cloud provider has *no knowledge and control of tenants’ RDMA applications*, and tenants can consume RNIC microarchitecture resources in arbitrary manners.

### Contributions

* Study the impact of all types of control verbs and exceptions on RDMA microarchitecture resource consumption.
* Present a model that represents how RDMA operations use RNIC resources.
* Develop a *test suite* to evaluate RDMA performance isolation solutions. It shows that none of the existing solutions can pass the test suite, including three major RNIC vendors, NVIDIA (acknowledge their results), Chelsio and Intel.

### Key findings

* Microarchitecture resources
  * NIC caches
    * *Control verbs* can cause excessive cache misses and a drastic performance reduction.
    * *Data verbs* contend for different RNIC caches.
    * Wide range access across many objects (QP, CQ, MR) causes ICM cache misses.
  * Processing units
    * Performance interference between different *data verbs* depends on the complexity of verbs.
    * *Error handling* can stall RNIC processing units and hang all the applications.
    * The impact of *control verbs* is restricted by its kernel involvement.
  * PCIe bandwidth
    * Will only become the bottleneck when the request size is in a specific range.
