Understanding RDMA microarchitecture resources for performance isolation

#RDMA #performance_isolation #test_suite #RNIC #virtual_machine #RDMA_microarchitecture_resource

Meta Info

Presented in NSDI 2023.

Authors: Xinhao Kong, Jingrong Chen (Duke University), Wei Bai (Microsoft), Yechen Xu (SJTU), Mahmoud Elhaddad, Shachar Raindel, Jitendra Padhye (Microsoft), Alvin R. Lebeck, Danyang Zhuo (Duke University).

Benchmark suite: https://github.com/host-bench/husky

Understanding the paper

Background

Cloud providers are working towards supporting RDMA in general-purpose guest VMs to benefit third-party workloads (as opposed to first-party workloads such as storage and ML). Thus they must provide performance isolation for tenants sharing the same RNIC.
RDMA brings unique challenges to network performance isolation due to its complex NIC microarchitecture resources (e.g., NIC cache and processing units).
This work looks at how these microarchitecture resources affect RDMA performance isolation from a public cloud provider’s perspective. The cloud provider has no knowledge and control of tenants’ RDMA applications, and tenants can consume RNIC microarchitecture resources in arbitrary manners.

Contributions

Study the impact of all types of control verbs and exceptions on RDMA microarchitecture resource consumption.
Present a model that represents how RDMA operations use RNIC resources.
Develop a test suite to evaluate RDMA performance isolation solutions. It shows that none of the existing solutions can pass the test suite, including three major RNIC vendors, NVIDIA (acknowledge their results), Chelsio and Intel.

Key findings

Microarchitecture resources
- NIC caches
  - Control verbs can cause excessive cache misses and a drastic performance reduction.
  - Data verbs contend for different RNIC caches.
  - Wide range access across many objects (QP, CQ, MR) causes ICM cache misses.
- Processing units
  - Performance interference between different data verbs depends on the complexity of verbs.
  - Error handling can stall RNIC processing units and hang all the applications.
  - The impact of control verbs is restricted by its kernel involvement.
- PCIe bandwidth
  - Will only become the bottleneck when the request size is in a specific range.

Last updated 7 months ago

Was this helpful?