Related papers: Asynchronous Memory Access Unit: Exploiting Massiv…

Asynchronous Memory Access Unit for General Purpose Processors

In future data centers, applications will make heavy use of far memory (including disaggregated memory pools and NVM). The access latency of far memory is more widely distributed than that of local memory accesses. This makes the efficiency…

Hardware Architecture · Computer Science 2021-12-28 Luming Wang , Xu Zhang , Tianyue Lu , Mingyu Chen

The Granularity Gap Problem: A Hurdle for Applying Approximate Memory to Complex Data Layout

The main memory access latency has not much improved for more than two decades while the CPU performance had been exponentially increasing until recently. Approximate memory is a technique to reduce the DRAM access latency in return of…

Emerging Technologies · Computer Science 2021-01-27 Soramichi Akiyama , Ryota Shioya

Enabling and Exploiting Partition-Level Parallelism (PALP) in Phase Change Memories

Phase-change memory (PCM) devices have multiple banks to serve memory requests in parallel. Unfortunately, if two requests go to the same bank, they have to be served one after another, leading to lower system performance. We observe that a…

Hardware Architecture · Computer Science 2019-08-22 Shihao Song , Anup Das , Onur Mutlu , Nagarajan Kandasamy

Understanding and Improving the Latency of DRAM-Based Memory Systems

Over the past two decades, the storage capacity and access bandwidth of main memory have improved tremendously, by 128x and 20x, respectively. These improvements are mainly due to the continuous technology scaling of DRAM (dynamic…

Hardware Architecture · Computer Science 2017-12-25 Kevin K. Chang

Active Access: A Mechanism for High-Performance Distributed Data-Centric Computations

Remote memory access (RMA) is an emerging high-performance programming model that uses RDMA hardware directly. Yet, accessing remote memories cannot invoke activities at the target which complicates implementation and limits performance of…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-10-20 Maciej Besta , Torsten Hoefler

Cimple: Instruction and Memory Level Parallelism

Modern out-of-order processors have increased capacity to exploit instruction level parallelism (ILP) and memory level parallelism (MLP), e.g., by using wide superscalar pipelines and vector execution units, as well as deep buffers for…

Programming Languages · Computer Science 2018-07-05 Vladimir Kiriansky , Haoran Xu , Martin Rinard , Saman Amarasinghe

PAM: Processing Across Memory Hierarchy for Efficient KV-centric LLM Serving System

The widespread adoption of Large Language Models (LLMs) has exponentially increased the demand for efficient serving systems. With growing requests and context lengths, key-value (KV)-related operations, including attention computation and…

Hardware Architecture · Computer Science 2026-02-13 Lian Liu , Shixin Zhao , Yutian Zhou , Yintao He , Mengdi Wang , Yinhe Han , Ying Wang

Design Space Exploration of Algorithmic Multi-Port Memories in High-Performance Application-Specific Accelerators

Memory load/store instructions consume an important part in execution time and energy consumption in domain-specific accelerators. For designing highly parallel systems, available parallelism at each granularity is extracted from the…

Hardware Architecture · Computer Science 2022-09-07 Khushal Sethi

Emulating a large memory with a collection of small ones

Sequential computation is well understood but does not scale well with current technology. Within the next decade, systems will contain large numbers of processors with potentially thousands of processors per chip. Despite this, many…

Hardware Architecture · Computer Science 2015-11-17 James Hanlon

CoroAMU: Unleashing Memory-Driven Coroutines through Latency-Aware Decoupled Operations

Modern data-intensive applications face memory latency challenges exacerbated by disaggregated memory systems. Recent work shows that coroutines are promising in effectively interleaving tasks and hiding memory latency, but they struggle to…

Hardware Architecture · Computer Science 2025-11-20 Zhuolun Jiang , Songyue Wang , Xiaokun Pei , Tianyue Lu , Mingyu Chen

Towards Efficient OpenMP Strategies for Non-Uniform Architectures

Parallel processing is considered as todays and future trend for improving performance of computers. Computing devices ranging from small embedded systems to big clusters of computers rely on parallelizing applications to reduce execution…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-11-27 Oussama Tahan

A Framework for High-throughput Sequence Alignment using Real Processing-in-Memory Systems

Sequence alignment is a memory bound computation whose performance in modern systems is limited by the memory bandwidth bottleneck. Processing-in-memory architectures alleviate this bottleneck by providing the memory with computing…

Hardware Architecture · Computer Science 2023-03-28 Safaa Diab , Amir Nassereldine , Mohammed Alser , Juan Gómez-Luna , Onur Mutlu , Izzat El Hajj

Algorithms in the Ultra-Wide Word Model

The effective use of parallel computing resources to speed up algorithms in current multi-core parallel architectures remains a difficult challenge, with ease of programming playing a key role in the eventual success of various parallel…

Data Structures and Algorithms · Computer Science 2014-12-09 Arash Farzan , Alejandro López-Ortiz , Patrick K. Nicholson , Alejandro Salinger

Continual Learning Approach for Improving the Data and Computation Mapping in Near-Memory Processing System

The resurgence of near-memory processing (NMP) with the advent of big data has shifted the computation paradigm from processor-centric to memory-centric computing. To meet the bandwidth and capacity demands of memory-centric computing, 3D…

Hardware Architecture · Computer Science 2021-04-29 Pritam Majumder , Jiayi Huang , Sungkeun Kim , Abdullah Muzahid , Dylan Siegers , Chia-Che Tsai , Eun Jung Kim

AMA: Adaptive Memory via Multi-Agent Collaboration

The rapid evolution of Large Language Model (LLM) agents has necessitated robust memory systems to support cohesive long-term interaction and complex reasoning. Benefiting from the strong capabilities of LLMs, recent research focus has…

Artificial Intelligence · Computer Science 2026-04-16 Weiquan Huang , Zixuan Wang , Hehai Lin , Sudong Wang , Bo Xu , Qian Li , Beier Zhu , Linyi Yang , Chengwei Qin

In-memory Associative Processors: Tutorial, Potential, and Challenges

In-memory computing is an emerging computing paradigm that overcomes the limitations of exiting Von-Neumann computing architectures such as the memory-wall bottleneck. In such paradigm, the computations are performed directly on the data…

Emerging Technologies · Computer Science 2022-04-14 Mohammed E. Fouda , Hasan Erdem Yantir , Ahmed M. Eltawil , Fadi Kurdahi

A Customized Memory-aware Architecture for Biological Sequence Alignment

Sequence alignment is a fundamental process in computational biology which identifies regions of similarity in biological sequences. With the exponential growth in the volume of data in bioinformatics databases, the time, processing power,…

Hardware Architecture · Computer Science 2025-07-31 Nasrin Akbari , Mehdi Modarressi , Alireza Khadem

Fault Tolerance for Remote Memory Access Programming Models

Remote Memory Access (RMA) is an emerging mechanism for programming high-performance computers and datacenters. However, little work exists on resilience schemes for RMA-based applications and systems. In this paper we analyze fault…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-10-20 Maciej Besta , Torsten Hoefler

A Hybrid Parallelization of AIM for Multi-Core Clusters: Implementation Details and Benchmark Results on Ranger

This paper presents implementation details and empirical results for a hybrid message passing and shared memory paralleliziation of the adaptive integral method (AIM). AIM is implemented on a (near) petaflop supercomputing cluster of…

Computational Engineering, Finance, and Science · Computer Science 2010-10-08 Fangzhou Wei , Ali E. Yılmaz

pLUTo: Enabling Massively Parallel Computation in DRAM via Lookup Tables

Data movement between the main memory and the processor is a key contributor to execution time and energy consumption in memory-intensive applications. This data movement bottleneck can be alleviated using Processing-in-Memory (PiM). One…

Hardware Architecture · Computer Science 2025-01-24 João Dinis Ferreira , Gabriel Falcao , Juan Gómez-Luna , Mohammed Alser , Lois Orosa , Mohammad Sadrosadati , Jeremie S. Kim , Geraldo F. Oliveira , Taha Shahroodi , Anant Nori , Onur Mutlu