Related papers: Vector operations for accelerating expensive Bayes…

SIMD Parallel MCMC Sampling with Applications for Big-Data Bayesian Analytics

Computational intensity and sequential nature of estimation techniques for Bayesian methods in statistics and machine learning, combined with their increasing applications for big data analytics, necessitate both the identification of…

Computation · Statistics 2015-03-02 Alireza S. Mahani , Mansour T. A. Sharabiani

Efficient method for parallel computation of geodesic transformation on CPU

This paper introduces a fast Central Processing Unit (CPU) implementation of geodesic morphological operations using stream processing. In contrast to the current state-of-the-art, that focuses on achieving insensitivity to the filter sizes…

Performance · Computer Science 2019-12-02 Danijel Žlaus , Domen Mongus

A Study of Performance Programming of CPU, GPU accelerated Computers and SIMD Architecture

Parallel computing is a standard approach to achieving high-performance computing (HPC). Three commonly used methods to implement parallel computing include: 1) applying multithreading technology on single-core or multi-core CPUs; 2)…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-18 Xinyao Yi

SIMD-X: Programming and Processing of Graph Algorithms on GPUs

With high computation power and memory bandwidth, graphics processing units (GPUs) lend themselves to accelerate data-intensive analytics, especially when such applications fit the single instruction multiple data (SIMD) model. However,…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-12-12 Hang Liu , H. Howie Huang

Performance of SSE and AVX Instruction Sets

SSE (streaming SIMD extensions) and AVX (advanced vector extensions) are SIMD (single instruction multiple data streams) instruction sets supported by recent CPUs manufactured in Intel and AMD. This SIMD programming allows parallel…

High Energy Physics - Lattice · Physics 2013-11-05 Hwancheol Jeong , Sunghoon Kim , Weonjong Lee , Seok-Ho Myung

Leveraging SIMD for Accelerating Large-number Arithmetic

Large-number arithmetic, widely used in scientific computing and cryptography, has seen limited adoption of single instruction, multiple data (SIMD) parallelism on modern CPUs due to the inherent dependencies in traditional algorithms. We…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-27 Subhrajit Das , Abhishek Bichhawat , Yuvraj Patel

General Matrix-Matrix Multiplication Using SIMD features of the PIII

Generalised matrix-matrix multiplication forms the kernel of many mathematical algorithms. A faster matrix-matrix multiply immediately benefits these algorithms. In this paper we implement efficient matrix multiplication for large matrices…

Performance · Computer Science 2019-12-11 Douglas Aberdeen , Jonathan Baxter

Large-Scale Geospatial Processing on Multi-Core and Many-Core Processors: Evaluations on CPUs, GPUs and MICs

Geospatial Processing, such as queries based on point-to-polyline shortest distance and point-in-polygon test, are fundamental to many scientific and engineering applications, including post-processing large-scale environmental and climate…

Databases · Computer Science 2014-03-05 Jianting Zhang Simin You

Speculative Parallel Evaluation Of Classification Trees On GPGPU Compute Engines

We examine the problem of optimizing classification tree evaluation for on-line and real-time applications by using GPUs. Looking at trees with continuous attributes often used in image segmentation, we first put the existing algorithms for…

Distributed, Parallel, and Cluster Computing · Computer Science 2011-11-08 Jason Spencer

High performance SIMD modular arithmetic for polynomial evaluation

Two essential problems in Computer Algebra, namely polynomial factorization and polynomial greatest common divisor computation, can be efficiently solved thanks to multiple polynomial evaluations in two variables using modular arithmetic.…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-04-27 Pierre Fortin , Ambroise Fleury , François Lemaire , Michael Monagan

Scanning HTML at Tens of Gigabytes per Second on ARM Processors

Modern processors have instructions to process 16 bytes or more at once. These instructions are called SIMD, for single instruction, multiple data. Recent advances have leveraged SIMD instructions to accelerate parsing of common Internet…

Data Structures and Algorithms · Computer Science 2025-06-05 Daniel Lemire

Parallel Prefix Sum with SIMD

The prefix sum operation is a useful primitive with a broad range of applications. For database systems, it is a building block of many important operators including join, sort and filter queries. In this paper, we study different methods…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-12-25 Wangda Zhang , Yanbin Wang , Kenneth A. Ross

SIMD-ified R-tree Query Processing and Optimization

The introduction of Single Instruction Multiple Data (SIMD) instructions in mainstream CPUs has enabled modern database engines to leverage data parallelism by performing more computation with a single instruction, resulting in a reduced…

Databases · Computer Science 2023-12-27 Yeasir Rayhan , Walid G. Aref

A hybrid algorithm for parallel molecular dynamics simulations

This article describes algorithms for the hybrid parallelization and SIMD vectorization of molecular dynamics simulations with short-range forces. The parallelization method combines domain decomposition with a thread-based parallelization…

Materials Science · Physics 2017-09-13 Chris M. Mangiardi , Ralf Meyer

SIMD$^2$: A Generalized Matrix Instruction Set for Accelerating Tensor Computation beyond GEMM

Matrix-multiplication units (MXUs) are now prevalent in every computing platform. The key attribute that makes MXUs so successful is the semiring structure, which allows tiling for both parallelism and data reuse. Nonetheless,…

Hardware Architecture · Computer Science 2022-09-02 Yunan Zhang , Po-An Tsai , Hung-Wei Tseng

A General SIMD-based Approach to Accelerating Compression Algorithms

Compression algorithms are important for data oriented tasks, especially in the era of Big Data. Modern processors equipped with powerful SIMD instruction sets, provide us an opportunity for achieving better compression performance.…

Information Retrieval · Computer Science 2015-04-15 Wayne Xin Zhao , Xudong Zhang , Daniel Lemire , Dongdong Shan , Jian-Yun Nie , Hongfei Yan , Ji-Rong Wen

Acceleration of multi-component multiple-precision arithmetic with branch-free algorithms and SIMD vectorization

Multiple-precision floating-point branch-free algorithms can significantly accelerate multi-component arithmetic implemented by combining hardware-based binary64 and binary32, particularly for triple- and quadruple-precision computations.…

Mathematical Software · Computer Science 2026-05-08 Tomonori Kouya

Comparing the Performance of Different x86 SIMD Instruction Sets for a Medical Imaging Application on Modern Multi- and Manycore Chips

Single Instruction, Multiple Data (SIMD) vectorization is a major driver of performance in current architectures, and is mandatory for achieving good performance with codes that are limited by instruction throughput. We investigate the…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-01-30 Johannes Hofmann , Jan Treibig , Georg Hager , Gerhard Wellein

SIMDRAM: A Framework for Bit-Serial SIMD Processing Using DRAM

Processing-using-DRAM has been proposed for a limited set of basic operations (i.e., logic operations, addition). However, in order to enable the full adoption of processing-using-DRAM, it is necessary to provide support for more complex…

Hardware Architecture · Computer Science 2020-12-23 Nastaran Hajinazar , Geraldo F. Oliveira , Sven Gregorio , João Dinis Ferreira , Nika Mansouri Ghiasi , Minesh Patel , Mohammed Alser , Saugata Ghose , Juan Gómez-Luna , Onur Mutlu

Parallel Scan on Ascend AI Accelerators

We design and implement parallel prefix sum (scan) algorithms using Ascend AI accelerators. Ascend accelerators feature specialized computing units: the cube units for efficient matrix multiplication and the vector units for optimized…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-05 Bartłomiej Wróblewski , Gioele Gottardo , Anastasios Zouzias