Understanding RDMA microarchitecture resources for performance isolation
#RDMA #performance_isolation #test_suite #RNIC #virtual_machine #RDMA_microarchitecture_resource
Meta Info
Presented in NSDI 2023.
Authors: Xinhao Kong, Jingrong Chen (Duke University), Wei Bai (Microsoft), Yechen Xu (SJTU), Mahmoud Elhaddad, Shachar Raindel, Jitendra Padhye (Microsoft), Alvin R. Lebeck, Danyang Zhuo (Duke University).
Benchmark suite: https://github.com/host-bench/husky
Understanding the paper
Background
Cloud providers are working towards supporting RDMA in general-purpose guest VMs to benefit third-party workloads (as opposed to first-party workloads such as storage and ML). Thus they must provide performance isolation for tenants sharing the same RNIC.
RDMA brings unique challenges to network performance isolation due to its complex NIC microarchitecture resources (e.g., NIC cache and processing units).
This work looks at how these microarchitecture resources affect RDMA performance isolation from a public cloud provider’s perspective. The cloud provider has no knowledge and control of tenants’ RDMA applications, and tenants can consume RNIC microarchitecture resources in arbitrary manners.
Contributions
Study the impact of all types of control verbs and exceptions on RDMA microarchitecture resource consumption.
Present a model that represents how RDMA operations use RNIC resources.
Develop a test suite to evaluate RDMA performance isolation solutions. It shows that none of the existing solutions can pass the test suite, including three major RNIC vendors, NVIDIA (acknowledge their results), Chelsio and Intel.
Key findings
Microarchitecture resources
NIC caches
Control verbs can cause excessive cache misses and a drastic performance reduction.
Data verbs contend for different RNIC caches.
Wide range access across many objects (QP, CQ, MR) causes ICM cache misses.
Processing units
Performance interference between different data verbs depends on the complexity of verbs.
Error handling can stall RNIC processing units and hang all the applications.
The impact of control verbs is restricted by its kernel involvement.
PCIe bandwidth
Will only become the bottleneck when the request size is in a specific range.
Last updated