Direct access, high-performance memory disaggregation with DirectCXL

Metadata

Presented in ATC 2022.

Authors: Donghyun Gouk, Sangwon Lee, Miryeong Kwon, Myoungsoo Jung (KAIST)

Understanding the paper

TL;DRs

The first work that brings CXL 2.0 into a real system and analyzes the performance characteristics of CXL-enabled disaggregated memory design.

Existing works

  • Two different approaches based on how they manage data between a host and memory server(s)

    • Page-based: utilize virtual memory techniques to use disaggregated memory without a code change; intercept paging requests when there is a page fault; swap the data to a remote memory node instead of the underlying storage.

    • Object-based: handle disaggregated memory from a remote using their own database (e.g., key-value store); directly intervene in RDMA data transfers; require significant source-level modifications and interface changes.

  • All the existing approaches need to move data from the remote memory to the host memory over RDMA (or similar fine-grain network interfaces); data movement and its accompanying operations (e.g., page cache management) introduce redundant memory copies and software fabric intervention.

Contributions

  • Disaggregate memory over CXL and integrate the disaggregated memory into processor-side system memory.

    • Implement CXL controller that employs multiple DRAM modules on a remote side.

    • Implement CXL software runtime that allows users to utilize the underlying disaggregated memory resources through sheer load/store instructions.

  • Prototype DirectCXL using many customized memory add-in-cards, 16nm FPGA-based processor nodes, a switch, and a PCIe backplane.

CXL vs. RDMA

  • RDMA-based: all DRAM modules and their interfaces are designed as passive peripherals; require the control computing resources at the remote side.

  • CXL-based: allow the host computing resources directly access the underlying memory through PCIe buses.

Performance evaluation

  • Compared to RDMA-based memory disaggregation, 6.2x shorter latency & 3x better performance.

Last updated