Related papers: Parallel Prefix Sum with SIMD

Efficient Additions and Montgomery Reductions of Large Integers for SIMD

This paper presents efficient algorithms, designed to leverage SIMD for performing Montgomery reductions and additions on integers larger than 512 bits. The existing algorithms encounter inefficiencies when parallelized using SIMD due to…

Cryptography and Security · Computer Science 2023-09-01 Pengchang Ren , Reiji Suda , Vorapong Suppakitpaisarn

SIMDRAM: A Framework for Bit-Serial SIMD Processing Using DRAM

Processing-using-DRAM has been proposed for a limited set of basic operations (i.e., logic operations, addition). However, in order to enable the full adoption of processing-using-DRAM, it is necessary to provide support for more complex…

Hardware Architecture · Computer Science 2020-12-23 Nastaran Hajinazar , Geraldo F. Oliveira , Sven Gregorio , João Dinis Ferreira , Nika Mansouri Ghiasi , Minesh Patel , Mohammed Alser , Saugata Ghose , Juan Gómez-Luna , Onur Mutlu

Vector operations for accelerating expensive Bayesian computations -- a tutorial guide

Many applications in Bayesian statistics are extremely computationally intensive. However, they are often inherently parallel, making them prime targets for modern massively parallel processors. Multi-core and distributed computing is…

Computation · Statistics 2021-05-10 David J. Warne , Scott A. Sisson , Christopher Drovandi

SIMD Parallel MCMC Sampling with Applications for Big-Data Bayesian Analytics

Computational intensity and sequential nature of estimation techniques for Bayesian methods in statistics and machine learning, combined with their increasing applications for big data analytics, necessitate both the identification of…

Computation · Statistics 2015-03-02 Alireza S. Mahani , Mansour T. A. Sharabiani

SIMD-ified R-tree Query Processing and Optimization

The introduction of Single Instruction Multiple Data (SIMD) instructions in mainstream CPUs has enabled modern database engines to leverage data parallelism by performing more computation with a single instruction, resulting in a reduced…

Databases · Computer Science 2023-12-27 Yeasir Rayhan , Walid G. Aref

Parallel Dynamics Computation using Prefix Sum Operations

We propose a new parallel framework for fast computation of inverse and forward dynamics of articulated robots based on prefix sums (scans). We re-investigate the well-known recursive Newton-Euler formulation of robot dynamics and show that…

Robotics · Computer Science 2016-09-16 Yajue Yang , Yuanqing Wu , Jia Pan

Parallel Breadth-First Search on Distributed Memory Systems

Data-intensive, graph-based computations are pervasive in several scientific applications, and are known to to be quite challenging to implement on distributed memory systems. In this work, we explore the design space of parallel algorithms…

Distributed, Parallel, and Cluster Computing · Computer Science 2011-10-17 Aydin Buluc , Kamesh Madduri

A Study of Performance Programming of CPU, GPU accelerated Computers and SIMD Architecture

Parallel computing is a standard approach to achieving high-performance computing (HPC). Three commonly used methods to implement parallel computing include: 1) applying multithreading technology on single-core or multi-core CPUs; 2)…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-18 Xinyao Yi

Shared-PIM: Enabling Concurrent Computation and Data Flow for Faster Processing-in-DRAM

Processing-in-Memory (PIM) enhances memory with computational capabilities, potentially solving energy and latency issues associated with data transfer between memory and processors. However, managing concurrent computation and data flow…

Hardware Architecture · Computer Science 2025-05-09 Ahmed Mamdouh , Haoran Geng , Michael Niemier , Xiaobo Sharon Hu , Dayane Reis

Aggregating over Dominated Points by Sorting, Scanning, Zip and Flat Maps

Prefix aggregation operation (also called scan), and its particular case, prefix summation, is an important parallel primitive and enjoys a lot of attention in the research literature. It is also used in many algorithms as one of the steps.…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-05-29 Jacek Sroka , Jerzy Tyszkiewicz

Practical Trade-Offs for the Prefix-Sum Problem

Given an integer array A, the prefix-sum problem is to answer sum(i) queries that return the sum of the elements in A[0..i], knowing that the integers in A can be changed. It is a classic problem in data structure design with a wide range…

Data Structures and Algorithms · Computer Science 2022-02-08 Giulio Ermanno Pibiri , Rossano Venturini

Scanning HTML at Tens of Gigabytes per Second on ARM Processors

Modern processors have instructions to process 16 bytes or more at once. These instructions are called SIMD, for single instruction, multiple data. Recent advances have leveraged SIMD instructions to accelerate parsing of common Internet…

Data Structures and Algorithms · Computer Science 2025-06-05 Daniel Lemire

A General SIMD-based Approach to Accelerating Compression Algorithms

Compression algorithms are important for data oriented tasks, especially in the era of Big Data. Modern processors equipped with powerful SIMD instruction sets, provide us an opportunity for achieving better compression performance.…

Information Retrieval · Computer Science 2015-04-15 Wayne Xin Zhao , Xudong Zhang , Daniel Lemire , Dongdong Shan , Jian-Yun Nie , Hongfei Yan , Ji-Rong Wen

Efficient Computation of Positional Population Counts Using SIMD Instructions

In several fields such as statistics, machine learning, and bioinformatics, categorical variables are frequently represented as one-hot encoded vectors. For example, given 8 distinct values, we map each value to a byte where only a single…

Data Structures and Algorithms · Computer Science 2021-08-19 Marcus D. R. Klarqvist , Wojciech Muła , Daniel Lemire

Parallel Combining: Benefits of Explicit Synchronization

Parallel batched data structures are designed to process synchronized batches of operations in a parallel computing model. In this paper, we propose parallel combining, a technique that implements a concurrent data structure from a parallel…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-11-14 Vitaly Aksenov , Petr Kuznetsov , Anatoly Shalyto

Concurrent Processing Memory

A theoretical memory with limited processing power and internal connectivity at each element is proposed. This memory carries out parallel processing within itself to solve generic array problems. The applicability of this in-memory…

Distributed, Parallel, and Cluster Computing · Computer Science 2010-09-28 Chengpu Wang

Optimal (Randomized) Parallel Algorithms in the Binary-Forking Model

In this paper we develop optimal algorithms in the binary-forking model for a variety of fundamental problems, including sorting, semisorting, list ranking, tree contraction, range minima, and ordered set union, intersection and difference.…

Data Structures and Algorithms · Computer Science 2020-06-26 Guy E. Blelloch , Jeremy T. Fineman , Yan Gu , Yihan Sun

A flexible algorithm for calculating pair interactions on SIMD architectures

Calculating interactions or correlations between pairs of particles is typically the most time-consuming task in particle simulation or correlation analysis. Straightforward implementations using a double loop over particle pairs have…

Computational Physics · Physics 2015-06-16 Szilárd Páll , Berk Hess

SIMD Compression and the Intersection of Sorted Integers

Sorted lists of integers are commonly used in inverted indexes and database systems. They are often compressed in memory. We can use the SIMD instructions available in common processors to boost the speed of integer compression schemes. Our…

Information Retrieval · Computer Science 2020-04-22 Daniel Lemire , Leonid Boytsov , Nathan Kurz

Membrane: Accelerating Database Analytics with Bank-Level DRAM-PIM Filtering

In-memory database query processing frequently involves substantial data transfers between the CPU and memory, leading to inefficiencies due to Von Neumann bottleneck. Processing-in-Memory (PIM) architectures offer a viable solution to…

Hardware Architecture · Computer Science 2025-04-10 Akhil Shekar , Kevin Gaffney , Martin Prammer , Khyati Kiyawat , Lingxi Wu , Helena Caminal , Zhenxing Fan , Yimin Gao , Ashish Venkat , José F. Martínez , Jignesh Patel , Kevin Skadron