Related papers: RowClone: Accelerating Data Movement and Initializ…

The Processing Using Memory Paradigm:In-DRAM Bulk Copy, Initialization, Bitwise AND and OR

In existing systems, the off-chip memory interface allows the memory controller to perform only read or write operations. Therefore, to perform any operation, the processor must first read the source data and then write the result back to…

Hardware Architecture · Computer Science 2016-11-01 Vivek Seshadri , Onur Mutlu

Simple DRAM and Virtual Memory Abstractions to Enable Highly Efficient Memory Systems

In most modern systems, the memory subsystem is managed and accessed at multiple different granularities at various resources. We observe that such multi-granularity management results in significant inefficiency in the memory subsystem.…

Hardware Architecture · Computer Science 2016-05-23 Vivek Seshadri

NOM: Network-On-Memory for Inter-Bank Data Transfer in Highly-Banked Memories

Data copy is a widely-used memory operation in many programs and operating system services. In conventional computers, data copy is often carried out by two separate read and write transactions that pass data back and forth between the DRAM…

Hardware Architecture · Computer Science 2020-04-27 Seyyed Hossein SeyyedAghaei Rezaei , Mehdi Modarressi , Rachata Ausavarungnirun , Mohammad Sadrosadati , Onur Mutlu , Masoud Daneshtalab

Relational Memory: Native In-Memory Accesses on Rows and Columns

Analytical database systems are typically designed to use a column-first data layout to access only the desired fields. On the other hand, storing data row-first works great for accessing, inserting, or updating entire rows. Transforming…

Databases · Computer Science 2022-02-08 Shahin Roozkhosh , Denis Hoornaert , Ju Hyoung Mun , Tarikul Islam Papon , Ahmed Sanaullah , Ulrich Drepper , Renato Mancuso , Manos Athanassoulis

CLONE: Customizing LLMs for Efficient Latency-Aware Inference at the Edge

Deploying large language models (LLMs) on edge devices is crucial for delivering fast responses and ensuring data privacy. However, the limited storage, weight, and power of edge devices make it difficult to deploy LLM-powered applications.…

Hardware Architecture · Computer Science 2025-06-04 Chunlin Tian , Xinpeng Qin , Kahou Tam , Li Li , Zijian Wang , Yuanzhe Zhao , Minglei Zhang , Chengzhong Xu

NetClone: Fast, Scalable, and Dynamic Request Cloning for Microsecond-Scale RPCs

Spawning duplicate requests, called cloning, is a powerful technique to reduce tail latency by masking service-time variability. However, traditional client-based cloning is static and harmful to performance under high load, while a recent…

Networking and Internet Architecture · Computer Science 2023-07-26 Gyuyeong Kim

Cyclone: High Availability for Persistent Key Value Stores

Persistent key value stores are an important component of many distributed data serving solutions with innovations targeted at taking advantage of growing flash speeds. Unfortunately their performance is hampered by the need to maintain and…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-11-21 Amitabha Roy , Subramanya R. Dulloor

RoMe: Row Granularity Access Memory System for Large Language Models

Modern HBM-based memory systems have evolved over generations while retaining cache line granularity accesses. Preserving this fine granularity necessitated the introduction of bank groups and pseudo channels. These structures expand timing…

Hardware Architecture · Computer Science 2025-12-02 Hwayong Nam , Seungmin Baek , Jumin Kim , Michael Jaemin Kim , Jung Ho Ahn

Fast Updates on Read-Optimized Databases Using Multi-Core CPUs

Read-optimized columnar databases use differential updates to handle writes by maintaining a separate write-optimized delta partition which is periodically merged with the read-optimized and compressed main partition. This merge process…

Databases · Computer Science 2015-03-19 Jens Krueger , Changkyu Kim , Martin Grund , Nadathur Satish , David Schwalb , Jatin Chhugani , Hasso Plattner , Pradeep Dubey , Alexander Zeier

PiDRAM: An FPGA-based Framework for End-to-end Evaluation of Processing-in-DRAM Techniques

DRAM-based main memory is used in nearly all computing systems as a major component. One way of overcoming the main memory bottleneck is to move computation near memory, a paradigm known as processing-in-memory (PiM). Recent PiM techniques…

Hardware Architecture · Computer Science 2022-06-02 Ataberk Olgun , Juan Gomez Luna , Konstantinos Kanellopoulos , Behzad Salami , Hasan Hassan , Oguz Ergin , Onur Mutlu

Fast, Multicore-Scalable, Low-Fragmentation Memory Allocation through Large Virtual Memory and Global Data Structures

We demonstrate that general-purpose memory allocation involving many threads on many cores can be done with high performance, multicore scalability, and low memory consumption. For this purpose, we have designed and implemented scalloc, a…

Programming Languages · Computer Science 2015-08-26 Martin Aigner , Christoph M. Kirsch , Michael Lippautz , Ana Sokolova

PULSAR: Simultaneous Many-Row Activation for Reliable and High-Performance Computing in Off-the-Shelf DRAM Chips

Data movement between the processor and the main memory is a first-order obstacle against improving performance and energy efficiency in modern systems. To address this obstacle, Processing-using-Memory (PuM) is a promising approach where…

Hardware Architecture · Computer Science 2024-03-19 Ismail Emir Yuksel , Yahya Can Tugrul , F. Nisa Bostanci , Abdullah Giray Yaglikci , Ataberk Olgun , Geraldo F. Oliveira , Melina Soysal , Haocong Luo , Juan Gomez Luna , Mohammad Sadrosadati , Onur Mutlu

Towards Reconfigurable Linearizable Reads

Linearizable datastores are desirable because they provide users with the illusion that the datastore is run on a single machine that performs client operations one at a time. To reduce the performance cost of providing this illusion, many…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-04-09 Myles Thiessen , Aleksey Panas , Guy Khazma , Eyal de Lara

Exploiting Row-Level Temporal Locality in DRAM to Reduce the Memory Access Latency

This paper summarizes the idea of ChargeCache, which was published in HPCA 2016 [51], and examines the work's significance and future potential. DRAM latency continues to be a critical bottleneck for system performance. In this work, we…

Hardware Architecture · Computer Science 2018-05-11 Hasan Hassan , Gennady Pekhimenko , Nandita Vijaykumar , Vivek Seshadri , Donghyuk Lee , Oguz Ergin , Onur Mutlu

Memory-Centric Computing: Recent Advances in Processing-in-DRAM

Memory-centric computing aims to enable computation capability in and near all places where data is generated and stored. As such, it can greatly reduce the large negative performance and energy impact of data access and data movement, by…

Hardware Architecture · Computer Science 2024-12-30 Onur Mutlu , Ataberk Olgun , Geraldo F. Oliveira , Ismail Emir Yuksel

NEON: Enabling Efficient Support for Nonlinear Operations in Resistive RAM-based Neural Network Accelerators

Resistive Random-Access Memory (RRAM) is well-suited to accelerate neural network (NN) workloads as RRAM-based Processing-in-Memory (PIM) architectures natively support highly-parallel multiply-accumulate (MAC) operations that form the…

Hardware Architecture · Computer Science 2022-11-11 Aditya Manglik , Minesh Patel , Haiyu Mao , Behzad Salami , Jisung Park , Lois Orosa , Onur Mutlu

Supercharging Distributed Computing Environments For High Performance Data Engineering

The data engineering and data science community has embraced the idea of using Python & R dataframes for regular applications. Driven by the big data revolution and artificial intelligence, these applications are now essential in order to…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-01-20 Niranda Perera , Kaiying Shan , Supun Kamburugamuwe , Thejaka Amila Kanewela , Chathura Widanage , Arup Sarker , Mills Staylor , Tianle Zhong , Vibhatha Abeykoon , Geoffrey Fox

FALKON: An Optimal Large Scale Kernel Method

Kernel methods provide a principled way to perform non linear, nonparametric learning. They rely on solid functional analytic foundations and enjoy optimal statistical properties. However, at least in their basic form, they have limited…

Machine Learning · Statistics 2018-02-01 Alessandro Rudi , Luigi Carratino , Lorenzo Rosasco

YOLoC: DeploY Large-Scale Neural Network by ROM-based Computing-in-Memory using ResiduaL Branch on a Chip

Computing-in-memory (CiM) is a promising technique to achieve high energy efficiency in data-intensive matrix-vector multiplication (MVM) by relieving the memory bottleneck. Unfortunately, due to the limited SRAM capacity, existing…

Hardware Architecture · Computer Science 2022-08-18 Yiming Chen , Guodong Yin , Zhanhong Tan , Mingyen Lee , Zekun Yang , Yongpan Liu , Huazhong Yang , Kaisheng Ma , Xueqing Li

High Performance Dataframes from Parallel Processing Patterns

The data science community today has embraced the concept of Dataframes as the de facto standard for data representation and manipulation. Ease of use, massive operator coverage, and popularization of R and Python languages have heavily…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-07-06 Niranda Perera , Supun Kamburugamuve , Chathura Widanage , Vibhatha Abeykoon , Ahmet Uyar , Kaiying Shan , Hasara Maithree , Damitha Lenadora , Thejaka Amila Kanewala , Geoffrey Fox