Related papers: DynaSOAr: A Parallel Memory Allocator for Object-o…

DynaSOAr: Accelerating Single-Method Multiple-Objects Applications on GPUs

Object-oriented programming (OOP) has long been regarded as too inefficient for SIMD high-performance computing, despite the fact that many important HPC applications have an inherent object structure. We discovered a broad subset of OOP…

Programming Languages · Computer Science 2019-05-30 Matthias Springer

Memory-Efficient Object-Oriented Programming on GPUs

Object-oriented programming is often regarded as too inefficient for high-performance computing (HPC), despite the fact that many important HPC problems have an inherent object structure. Our goal is to bring efficient, object-oriented…

Programming Languages · Computer Science 2019-08-19 Matthias Springer

Simulation of high-performance memory allocators

For the last thirty years, a large variety of memory allocators have been proposed. Since performance, memory usage and energy consumption of each memory allocator differs, software engineers often face difficult choices in selecting the…

Operating Systems · Computer Science 2024-06-25 José L. Risco-Martín , J. Manuel Colmenar , David Atienza , J. Ignacio Hidalgo

Fast Bitmap Fit: A CPU Cache Line friendly memory allocator for single object allocations

Applications making excessive use of single-object based data structures (such as linked lists, trees, etc...) can see a drop in efficiency over a period of time due to the randomization of nodes in memory. This slow down is due to the…

Data Structures and Algorithms · Computer Science 2021-10-22 Dhruv Matani , Gaurav Menghani

A parallel evolutionary algorithm to optimize dynamic memory managers in embedded systems

For the last thirty years, several Dynamic Memory Managers (DMMs) have been proposed. Such DMMs include first fit, best fit, segregated fit and buddy systems. Since the performance, memory usage and energy consumption of each DMM differs,…

Neural and Evolutionary Computing · Computer Science 2024-07-16 José L. Risco-Martín , David Atienza , J. Manuel Colmenar , Oscar Garnica

Fast, Multicore-Scalable, Low-Fragmentation Memory Allocation through Large Virtual Memory and Global Data Structures

We demonstrate that general-purpose memory allocation involving many threads on many cores can be done with high performance, multicore scalability, and low memory consumption. For this purpose, we have designed and implemented scalloc, a…

Programming Languages · Computer Science 2015-08-26 Martin Aigner , Christoph M. Kirsch , Michael Lippautz , Ana Sokolova

Balancing Efficiency and Flexibility for DNN Acceleration via Temporal GPU-Systolic Array Integration

The research interest in specialized hardware accelerators for deep neural networks (DNN) spikes recently owing to their superior performance and efficiency. However, today's DNN accelerators primarily focus on accelerating specific…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-11 Cong Guo , Yangjie Zhou , Jingwen Leng , Yuhao Zhu , Zidong Du , Quan Chen , Chao Li , Bin Yao , Minyi Guo

GPUArmor: A Hardware-Software Co-design for Efficient and Scalable Memory Safety on GPUs

Memory safety errors continue to pose a significant threat to current computing systems, and graphics processing units (GPUs) are no exception. A prominent class of memory safety algorithms is allocation-based solutions. The key idea is to…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-02-27 Mohamed Tarek Ibn Ziad , Sana Damani , Mark Stephenson , Stephen W. Keckler , Aamer Jaleel

Parallel-in-Time Nonlinear Optimal Control via GPU-native Sequential Convex Programming

Real-time trajectory optimization for nonlinear constrained autonomous systems is critical and typically performed by CPU-based sequential solvers. Specifically, reliance on global sparse linear algebra or the serial nature of dynamic…

Robotics · Computer Science 2026-03-13 Yilin Zou , Zhong Zhang , Maxime Robic , Fanghua Jiang

Towards Performance-Aware Allocation for Accelerated Machine Learning on GPU-SSD Systems

The exponential growth of data-intensive machine learning workloads has exposed significant limitations in conventional GPU-accelerated systems, especially when processing datasets exceeding GPU DRAM capacity. We propose MQMS, an augmented…

Hardware Architecture · Computer Science 2024-12-10 Ayush Gundawar , Euijun Chung , Hyesoon Kim

PIM-malloc: A Fast and Scalable Dynamic Memory Allocator for Processing-In-Memory (PIM) Architectures

The ability to dynamically allocate memory is fundamental in modern programming languages. However, this feature is not adequately supported in current general-purpose PIM devices. To identify key design principles that PIM must consider,…

Hardware Architecture · Computer Science 2026-01-28 Dongjae Lee , Bongjoon Hyun , Youngjin Kwon , Minsoo Rhu

DYNAMO: Dynamic Neutral Atom Multi-programming Optimizer Towards Quantum Operating Systems

As quantum computing advances towards practical applications, quantum operating systems become inevitable, where multi-programming -- the core functionality of operating systems -- enables concurrent execution of multiple quantum programs…

Quantum Physics · Physics 2025-07-08 Wenjie Sun , Xiaoyu Li , Zhigang Wang , Geng Chen , Lianhui Yu , Guowu Yang

A Data-Driven Dynamic Execution Orchestration Architecture

Domain-specific accelerators deliver exceptional performance on their target workloads through fabrication-time orchestrated datapaths. However, such specialized architectures often exhibit performance fragility when exposed to new kernels…

Hardware Architecture · Computer Science 2026-02-20 Zhenyu Bai , Pranav Dangi , Rohan Juneja , Zhaoying Li , Zhanglu Yan , Huiying Lan , Tulika Mitra

SIMD-X: Programming and Processing of Graph Algorithms on GPUs

With high computation power and memory bandwidth, graphics processing units (GPUs) lend themselves to accelerate data-intensive analytics, especially when such applications fit the single instruction multiple data (SIMD) model. However,…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-12-12 Hang Liu , H. Howie Huang

VDCores: Resource Decoupled Programming and Execution for Asynchronous GPU

Modern GPUs increasingly rely on specialized and asynchronous hardware units to deliver high performance. Yet these units are often underutilized because today's GPU software stacks still organize programming and execution around a…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-06 Zijian He , Adrian Sampson , Yiying Zhang , Zhiyuan Guo

A Study of Performance Programming of CPU, GPU accelerated Computers and SIMD Architecture

Parallel computing is a standard approach to achieving high-performance computing (HPC). Three commonly used methods to implement parallel computing include: 1) applying multithreading technology on single-core or multi-core CPUs; 2)…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-18 Xinyao Yi

Harnessing Manycore Processors with Distributed Memory for Accelerated Training of Sparse and Recurrent Models

Current AI training infrastructure is dominated by single instruction multiple data (SIMD) and systolic array architectures, such as Graphics Processing Units (GPUs) and Tensor Processing Units (TPUs), that excel at accelerating parallel…

Neural and Evolutionary Computing · Computer Science 2023-11-09 Jan Finkbeiner , Thomas Gmeinder , Mark Pupilli , Alexander Titterton , Emre Neftci

Exploring DAOS Interfaces and Performance

Distributed Asynchronous Object Store (DAOS) is a novel software-defined object store leveraging Non-Volatile Memory (NVM) devices, designed for high performance. It provides a number of interfaces for applications to undertake I/O, ranging…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-30 Nicolau Manubens , Johann Lombardi , Simon D. Smart , Emanuele Danovaro , Tiago Quintino , Dean Hildebrand , Adrian Jackson

Concurrent Processing Memory

A theoretical memory with limited processing power and internal connectivity at each element is proposed. This memory carries out parallel processing within itself to solve generic array problems. The applicability of this in-memory…

Distributed, Parallel, and Cluster Computing · Computer Science 2010-09-28 Chengpu Wang

Minotaur: A SIMD-Oriented Synthesizing Superoptimizer

A superoptimizing compiler--one that performs a meaningful search of the program space as part of the optimization process--can find optimization opportunities that are missed by even the best existing optimizing compilers. We created…

Programming Languages · Computer Science 2024-09-04 Zhengyang Liu , Stefan Mada , John Regehr