# Understanding RDMA microarchitecture resources for performance isolation

## Meta Info

Presented in [NSDI 2023](https://www.usenix.org/conference/nsdi23/presentation/kong).

Authors: Xinhao Kong, Jingrong Chen (*Duke University*), Wei Bai (*Microsoft*), Yechen Xu (SJTU), Mahmoud Elhaddad, Shachar Raindel, Jitendra Padhye (*Microsoft*), Alvin R. Lebeck, Danyang Zhuo (*Duke University*).

Benchmark suite: <https://github.com/host-bench/husky>

## Understanding the paper

### Background

* Cloud providers are working towards supporting RDMA in *general-purpose guest VMs* to benefit third-party workloads (as opposed to first-party workloads such as storage and ML). Thus they must provide *performance isolation* for tenants sharing the same RNIC.
* RDMA brings unique challenges to network performance isolation due to its complex NIC microarchitecture resources (e.g., NIC cache and processing units).
* This work looks at *how these microarchitecture resources affect RDMA performance isolation* from *a public cloud provider’s perspective*. The cloud provider has *no knowledge and control of tenants’ RDMA applications*, and tenants can consume RNIC microarchitecture resources in arbitrary manners.

### Contributions

* Study the impact of all types of control verbs and exceptions on RDMA microarchitecture resource consumption.
* Present a model that represents how RDMA operations use RNIC resources.
* Develop a *test suite* to evaluate RDMA performance isolation solutions. It shows that none of the existing solutions can pass the test suite, including three major RNIC vendors, NVIDIA (acknowledge their results), Chelsio and Intel.

### Key findings

* Microarchitecture resources
  * NIC caches
    * *Control verbs* can cause excessive cache misses and a drastic performance reduction.
    * *Data verbs* contend for different RNIC caches.
    * Wide range access across many objects (QP, CQ, MR) causes ICM cache misses.
  * Processing units
    * Performance interference between different *data verbs* depends on the complexity of verbs.
    * *Error handling* can stall RNIC processing units and hang all the applications.
    * The impact of *control verbs* is restricted by its kernel involvement.
  * PCIe bandwidth
    * Will only become the bottleneck when the request size is in a specific range.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://paper.lingyunyang.com/reading-notes/conference/nsdi-2023/husky.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
