Related papers: Offloading to CXL-based Computational Memory

UDON: A case for offloading to general purpose compute on CXL memory

Upcoming CXL-based disaggregated memory devices feature special purpose units to offload compute to near-memory. In this paper, we explore opportunities for offloading compute to general purpose cores on CXL memory devices, thereby enabling…

Emerging Technologies · Computer Science 2024-04-04 Jon Hermes , Josh Minor , Minjun Wu , Adarsh Patil , Eric Van Hensbergen

Modeling the Potential of Message-Free Communication via CXL.mem

Heterogeneous memory technologies are increasingly important instruments in addressing the memory wall in HPC systems. While most are deployed in single node setups, CXL.mem is a technology that implements memories that can be attached to…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-10 Stepan Vanecek , Matthew Turner , Manisha Gajbe , Matthew Wolf , Martin Schulz

CXLMemUring: A Hardware Software Co-design Paradigm for Asynchronous and Flexible Parallel CXL Memory Pool Access

CXL has been the emerging technology for expanding memory for both the host CPU and device accelerators with load/store interface. Extending memory coherency to the PCIe root complex makes the codesign more flexible in that you can access…

Hardware Architecture · Computer Science 2023-09-11 Yiwei Yang

A Programming Model for Disaggregated Memory over CXL

CXL (Compute Express Link) is an emerging open industry-standard interconnect between processing and memory devices that is expected to revolutionize the way systems are designed. It enables cache-coherent, shared memory pools in a…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-27 Gal Assa , Moritz Lumme , Lucas Bürgi , Michal Friedman , Ori Lahav

MPI-over-CXL: Enhancing Communication Efficiency in Distributed HPC Systems

MPI implementations commonly rely on explicit memory-copy operations, incurring overhead from redundant data movement and buffer management. This overhead notably impacts HPC workloads involving intensive inter-processor communication. In…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-17 Miryeong Kwon , Donghyun Gouk , Hyein Woo , Junhee Kim , Jinwoo Baek , Kyungkuk Nam , Sangyoon Ji , Jiseon Kim , Hanyeoreum Bae , Junhyeok Jang , Hyunwoo You , Junseok Moon , Myoungsoo Jung

Understanding and Optimizing Serverless Workloads in CXL-Enabled Tiered Memory

Recent Serverless workloads tend to be largescaled/CPU-memory intensive, such as DL, graph applications, that require dynamic memory-to-compute resources provisioning. Meanwhile, recent solutions seek to design page management strategies…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-09-26 Yuze Li , Shunyu Yao

PIM or CXL-PIM? Understanding Architectural Trade-offs Through Large-Scale Benchmarking

Processing-in-memory (PIM) reduces data movement by executing near memory, but our large-scale characterization on real PIM hardware shows that end-to-end performance is often limited by disjoint host and device address spaces that force…

Emerging Technologies · Computer Science 2025-11-20 I-Ting Lee , Bao-Kai Wang , Liang-Chi Chen , Wen Sheng Lim , Da-Wei Chang , Yu-Ming Chang , Chieng-Chung Ho

Next-Gen Computing Systems with Compute Express Link: a Comprehensive Survey

Interconnection is crucial for computing systems. However, the current interconnection performance between processors and devices, such as memory devices and accelerators, significantly lags behind their computing performance, severely…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-02-21 Chen Chen , Xinkui Zhao , Guanjie Cheng , Yuesheng Xu , Shuiguang Deng , Jianwei Yin

Rethinking Inter-Process Communication with Memory Operation Offloading

As multimodal and AI-driven services exchange hundreds of megabytes per request, existing IPC runtimes spend a growing share of CPU cycles on memory copies. Although both hardware and software mechanisms are exploring memory offloading,…

Operating Systems · Computer Science 2026-01-13 Misun Park , Richi Dubey , Yifan Yuan , Nam Sung Kim , Ada Gavrilovska

Memory Sharing with CXL: Hardware and Software Design Approaches

Compute Express Link (CXL) is a rapidly emerging coherent interconnect standard that provides opportunities for memory pooling and sharing. Memory sharing is a well-established software feature that improves memory utilization by avoiding…

Emerging Technologies · Computer Science 2024-04-05 Sunita Jain , Nagaradhesh Yeleswarapu , Hasan Al Maruf , Rita Gupta

CXLMemSim: A pure software simulated CXL.mem for performance characterization

CXLMemSim is a fast, lightweight simulation framework that enables performance characterization of memory systems based on Compute Express Link (CXL) .mem technology. CXL.mem allows disaggregation and pooling of memory to mitigate memory…

Performance · Computer Science 2025-06-18 Yiwei Yang , Brian Zhao , Yusheng Zheng , Pooneh Safayenikoo , Tanvir Ahmed Khan , Andi Quinn

Architectural and System Implications of CXL-enabled Tiered Memory

Memory disaggregation is an emerging technology that decouples memory from traditional memory buses, enabling independent scaling of compute and memory. Compute Express Link (CXL), an open-standard interconnect technology, facilitates…

Hardware Architecture · Computer Science 2025-03-27 Yujie Yang , Lingfeng Xiang , Peiran Du , Zhen Lin , Weishu Deng , Ren Wang , Andrey Kudryavtsev , Louis Ko , Hui Lu , Jia Rao

Streamlining CXL Adoption for Hyperscale Efficiency

In our exploration of Composable Memory systems utilizing CXL, we focus on overcoming adoption barriers at Hyperscale, underscored by economic models demonstrating Total Cost of Ownership (TCO). While CXL addresses the pressing memory…

Emerging Technologies · Computer Science 2024-04-05 Angelos Arelakis , Nilesh Shah , Yiannis Nikolakopoulos , Dimitrios Palyvos-Giannas

CCCL: Node-Spanning GPU Collectives with CXL Memory Pooling

Large language models (LLMs) training or inference across multiple nodes introduces significant pressure on GPU memory and interconnect bandwidth. The Compute Express Link (CXL) shared memory pool offers a scalable solution by enabling…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-08 Dong Xu , Han Meng , Xinyu Chen , Dengcheng Zhu , Wei Tang , Fei Liu , Liguang Xie , Wu Xiang , Rui Shi , Yue Li , Henry Hu , Hui Zhang , Jianping Jiang , Dong Li

CXL-ClusterSim: Modeling CXL-based Disaggregated Memory Cluster for Pooling and Sharing using gem5 and SST

Large-scale AI training and inference require hundreds of gigabytes to terabytes of DRAM with high peak to average utilization ratios, resulting in overprovisioning. In cloud computing, DRAM constitutes a significant share of the cost. Yet,…

Hardware Architecture · Computer Science 2026-05-28 Kaustav Goswami , Maryam Babaie , Hoa Nguyen , Venkatesh Akella , Jason Lowe-Power

CMAX-CAMEL: A Coarse-to-Fine Adaptive, Memory-Efficient, and Low-Power Edge Processor for Contrast Maximization

Contrast maximization (CMAX) is a direct geometric framework for event-based motion estimation, but its iterative warp-and-accumulate pipeline incurs input-dependent computation and frequent memory accesses, challenging real-time, low-power…

Hardware Architecture · Computer Science 2026-05-26 Kyeongpil Min , Jongin Choi , Kyeongwon Lee , Woojoo Lee

Exploring and Evaluating Real-world CXL: Use Cases and System Adoption

Compute eXpress Link (CXL) is emerging as a promising memory interface technology. However, its performance characteristics remain largely unclear due to the limited availability of production hardware. Key questions include: What are the…

Performance · Computer Science 2025-10-14 Xi Wang , Jie Liu , Jianbo Wu , Shuangyan Yang , Jie Ren , Bhanu Shankar , Dong Li

Performance Characterizations and Usage Guidelines of Samsung CXL Memory Module Hybrid Prototype

The growing prevalence of data-intensive workloads, such as artificial intelligence (AI), machine learning (ML), high-performance computing (HPC), in-memory databases, and real-time analytics, has exposed limitations in conventional memory…

Hardware Architecture · Computer Science 2025-03-31 Jianping Zeng , Shuyi Pei , Da Zhang , Yuchen Zhou , Amir Beygi , Xuebin Yao , Ramdas Kachare , Tong Zhang , Zongwang Li , Marie Nguyen , Rekha Pitchumani , Yang Soek Ki , Changhee Jung

CXL Memory as Persistent Memory for Disaggregated HPC: A Practical Approach

In the landscape of High-Performance Computing (HPC), the quest for efficient and scalable memory solutions remains paramount. The advent of Compute Express Link (CXL) introduces a promising avenue with its potential to function as a…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-08-22 Yehonatan Fridman , Suprasad Mutalik Desai , Navneet Singh , Thomas Willhalm , Gal Oren

Scalable Processing-Near-Memory for 1M-Token LLM Inference: CXL-Enabled KV-Cache Management Beyond GPU Limits

The expansion of context windows in large language models (LLMs) to multi-million tokens introduces severe memory and compute bottlenecks, particularly in managing the growing Key-Value (KV) cache. While Compute Express Link (CXL) enables…

Hardware Architecture · Computer Science 2025-11-04 Dowon Kim , MinJae Lee , Janghyeon Kim , HyuckSung Kwon , Hyeonggyu Jeong , Sang-Soo Park , Minyong Yoon , Si-Dong Roh , Yongsuk Kwon , Jinin So , Jungwook Choi