Related papers: SynCron: Efficient Synchronization Support for Nea…

CODA: Enabling Co-location of Computation and Data for Near-Data Processing

Recent studies have demonstrated that near-data processing (NDP) is an effective technique for improving performance and energy efficiency of data-intensive workloads. However, leveraging NDP in realistic systems with multiple memory…

Hardware Architecture · Computer Science 2018-12-05 Hyojong Kim , Ramyad Hadidi , Lifeng Nai , Hyesoon Kim , Nuwan Jayasena , Yasuko Eckert , Onur Kayiran , Gabriel H. Loh

Accelerating Irregular Applications via Efficient Synchronization and Data Access Techniques

Irregular applications comprise an increasingly important workload domain for many fields, including bioinformatics, chemistry, physics, social sciences and machine learning. Therefore, achieving high performance and energy efficiency in…

Hardware Architecture · Computer Science 2022-11-16 Christina Giannoula

NearPM: A Near-Data Processing System for Storage-Class Applications

Persistent Memory (PM) technologies enable program recovery to a consistent state in a case of failure. To ensure this crash-consistent behavior, programs need to enforce persist ordering by employing mechanisms, such as logging and…

Computational Engineering, Finance, and Science · Computer Science 2023-04-03 Yasas Seneviratne , Korakit Seemakhupt , Sihang Liu , Samira Khan

An Asynchronous Multi-core Accelerator for SNN inference

Spiking Neural Networks (SNNs) are extensively utilized in brain-inspired computing and neuroscience research. To enhance the speed and energy efficiency of SNNs, several many-core accelerators have been developed. However, maintaining the…

Neural and Evolutionary Computing · Computer Science 2024-07-31 Zhuo Chen , De Ma , Xiaofei Jin , Qinghui Xing , Ouwen Jin , Xin Du , Shuibing He , Gang Pan

Proxics: an efficient programming model for far memory accelerators

The use of disaggregated or far memory systems such as CXL memory pools has renewed interest in Near-Data Processing (NDP): situating cores close to memory to reduce bandwidth requirements to and from the CPU. Hardware designs for such…

Operating Systems · Computer Science 2026-04-21 Zikai Liu , Niels Pressel , Jasmin Schult , Roman Meier , Pengcheng Xu , Timothy Roscoe

Near Data Acceleration with Concurrent Host Access

Near-data accelerators (NDAs) that are integrated with main memory have the potential for significant power and performance benefits. Fully realizing these benefits requires the large available memory capacity to be shared between the host…

Hardware Architecture · Computer Science 2020-12-02 Benjamin Y. Cho , Yongkee Kwon , Sangkug Lym , Mattan Erez

ASAP: Asynchronous Approximate Data-Parallel Computation

Emerging workloads, such as graph processing and machine learning are approximate because of the scale of data involved and the stochastic nature of the underlying algorithms. These algorithms are often distributed over multiple machines…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-12-28 Asim Kadav , Erik Kruus

MRCN: Enhanced Coherence Mechanism for Near Memory Processing Architectures

In Near Memory Processing (NMP), processing elements(PEs) are placed near the 3D memory, reducing unnecessary data transfers between the CPU and the memory. However, as the CPUs and the PEs of the NMP use a shared memory space, maintaining…

Hardware Architecture · Computer Science 2023-12-13 Amit Kumar Kabat , Shubhang Pandey , TG Venkatesh

SYNPA: SMT Performance Analysis and Allocation of Threads to Cores in ARM Processors

Simultaneous multithreading processors improve throughput over single-threaded processors thanks to sharing internal core resources among instructions from distinct threads. However, resource sharing introduces inter-thread interference…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-10-20 Marta Navarro , Josué Feliu , Salvador Petit , María E. Gómez , Julio Sahuquillo

Low-overhead General-purpose Near-Data Processing in CXL Memory Expanders

Emerging Compute Express Link (CXL) enables cost-efficient memory expansion beyond the local DRAM of processors. While its CXL$.$mem protocol provides minimal latency overhead through an optimized protocol stack, frequent CXL memory…

Hardware Architecture · Computer Science 2024-10-07 Hyungkyu Ham , Jeongmin Hong , Geonwoo Park , Yunseon Shin , Okkyun Woo , Wonhyuk Yang , Jinhoon Bae , Eunhyeok Park , Hyojin Sung , Euicheol Lim , Gwangsun Kim

A Survey of Near-Data Processing Architectures for Neural Networks

Data-intensive workloads and applications, such as machine learning (ML), are fundamentally limited by traditional computing systems based on the von-Neumann architecture. As data movement operations and energy consumption become key…

Hardware Architecture · Computer Science 2021-12-24 Mehdi Hassanpour , Marc Riera , Antonio González

Energy-Efficient Hardware-Accelerated Synchronization for Shared-L1-Memory Multiprocessor Clusters

The steeply growing performance demands for highly power- and energy-constrained processing systems such as end-nodes of the internet-of-things (IoT) have led to parallel near-threshold computing (NTC), joining the energy-efficiency…

Hardware Architecture · Computer Science 2020-04-15 Florian Glaser , Giuseppe Tagliavini , Davide Rossi , Germain Haugou , Qiuting Huang , Luca Benini

Synch: A framework for concurrent data-structures and benchmarks

The recent advancements in multicore machines highlight the need to simplify concurrent programming in order to leverage their computational power. One way to achieve this is by designing efficient concurrent data structures (e.g. stacks,…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-03-31 Nikolaos D. Kallimanis

NDPage: Efficient Address Translation for Near-Data Processing Architectures via Tailored Page Table

Near-Data Processing (NDP) has been a promising architectural paradigm to address the memory wall problem for data-intensive applications. Practical implementation of NDP architectures calls for system support for better programmability,…

Hardware Architecture · Computer Science 2025-02-21 Qingcai Jiang , Buxin Tu , Hong An

Adaptive Performance Optimization under Power Constraint in Multi-thread Applications with Diverse Scalability

In modern data centers, energy usage represents one of the major factors affecting operational costs. Power capping is a technique that limits the power consumption of individual systems, which allows reducing the overall power demand at…

Performance · Computer Science 2017-09-05 Stefano Conoci , Pierangelo Di Sanzo , Bruno Ciciani , Francesco Quaglia

CaMDN: Enhancing Cache Efficiency for Multi-tenant DNNs on Integrated NPUs

With the rapid development of DNN applications, multi-tenant execution, where multiple DNNs are co-located on a single SoC, is becoming a prevailing trend. Although many methods are proposed in prior works to improve multi-tenant…

Hardware Architecture · Computer Science 2025-05-15 Tianhao Cai , Liang Wang , Limin Xiao , Meng Han , Zeyu Wang , Lin Sun , Xiaojian Liao

An Energy-Efficient Near-Data Processing Accelerator for DNNs that Optimizes Data Accesses

The constant growth of DNNs makes them challenging to implement and run efficiently on traditional compute-centric architectures. Some accelerators have attempted to add more compute units and on-chip buffers to solve the memory wall…

Hardware Architecture · Computer Science 2023-10-30 Bahareh Khabbazan , Marc Riera , Antonio González

Heterogeneous Multi-core Array-based DNN Accelerator

In this article, we investigate the impact of architectural parameters of array-based DNN accelerators on accelerator's energy consumption and performance in a wide variety of network topologies. For this purpose, we have developed a tool…

Hardware Architecture · Computer Science 2022-06-28 Mohammad Ali Maleki , Mehdi Kamal , Ali Afzali-Kusha

Syncopate: Efficient Multi-GPU AI Kernels via Automatic Chunk-Centric Compute-Communication Overlap

Communication has become a first-order bottleneck in large-cale GPU workloads, and existing distributed compilers address it mainly by overlapping whole compute and communication kernels at the stream level. This coarse granularity incurs…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-06 Xinwei Qiang , Yue Guan , Zhengding Hu , Keren Zhou , Yufei Ding , Adnan Aziz

A Novel Process Mapping Strategy in Clustered Environments

Nowadays the number of available processing cores within computing nodes which are used in recent clustered environments, are growing up with a rapid rate. Despite this trend, the number of available network interfaces in such computing…

Distributed, Parallel, and Cluster Computing · Computer Science 2012-07-13 Mohsen Soryani , Morteza Analoui , Ghobad Zarrinchian