Related papers: High performance SIMD modular arithmetic for polyn…

Efficient Modular Arithmetic for SIMD Devices

This paper describes several new improvements of modular arithmetic and how to exploit them in order to gain more efficient implementations of commonly used algorithms, especially in cryptographic applications. We further present a new…

Cryptography and Security · Computer Science 2013-10-15 Wilke Trei

Performance of SSE and AVX Instruction Sets

SSE (streaming SIMD extensions) and AVX (advanced vector extensions) are SIMD (single instruction multiple data streams) instruction sets supported by recent CPUs manufactured in Intel and AMD. This SIMD programming allows parallel…

High Energy Physics - Lattice · Physics 2013-11-05 Hwancheol Jeong , Sunghoon Kim , Weonjong Lee , Seok-Ho Myung

Vector operations for accelerating expensive Bayesian computations -- a tutorial guide

Many applications in Bayesian statistics are extremely computationally intensive. However, they are often inherently parallel, making them prime targets for modern massively parallel processors. Multi-core and distributed computing is…

Computation · Statistics 2021-05-10 David J. Warne , Scott A. Sisson , Christopher Drovandi

SIMD-X: Programming and Processing of Graph Algorithms on GPUs

With high computation power and memory bandwidth, graphics processing units (GPUs) lend themselves to accelerate data-intensive analytics, especially when such applications fit the single instruction multiple data (SIMD) model. However,…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-12-12 Hang Liu , H. Howie Huang

Modular SIMD arithmetic in Mathemagix

Modular integer arithmetic occurs in many algorithms for computer algebra, cryptography, and error correcting codes. Although recent microprocessors typically offer a wide range of highly optimized arithmetic functions, modular integer…

Mathematical Software · Computer Science 2014-07-15 Joris van der Hoeven , Grégoire Lecerf , Guillaume Quintin

General Matrix-Matrix Multiplication Using SIMD features of the PIII

Generalised matrix-matrix multiplication forms the kernel of many mathematical algorithms. A faster matrix-matrix multiply immediately benefits these algorithms. In this paper we implement efficient matrix multiplication for large matrices…

Performance · Computer Science 2019-12-11 Douglas Aberdeen , Jonathan Baxter

Truncated multiplication and batch software SIMD AVX512 implementation for faster Montgomery multiplications and modular exponentiation

This paper presents software implementations of batch computations, dealing with multi-precision integer operations. In this work, we use the Single Instruction Multiple Data (SIMD) AVX512 instruction set of the x86-64 processors, in…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-25 Laurent-Stéphane Didier , Nadia Mrabet , Léa Glandus , Jean-Marc Robert

Acceleration of multi-component multiple-precision arithmetic with branch-free algorithms and SIMD vectorization

Multiple-precision floating-point branch-free algorithms can significantly accelerate multi-component arithmetic implemented by combining hardware-based binary64 and binary32, particularly for triple- and quadruple-precision computations.…

Mathematical Software · Computer Science 2026-05-08 Tomonori Kouya

SIMD$^2$: A Generalized Matrix Instruction Set for Accelerating Tensor Computation beyond GEMM

Matrix-multiplication units (MXUs) are now prevalent in every computing platform. The key attribute that makes MXUs so successful is the semiring structure, which allows tiling for both parallelism and data reuse. Nonetheless,…

Hardware Architecture · Computer Science 2022-09-02 Yunan Zhang , Po-An Tsai , Hung-Wei Tseng

Leveraging SIMD for Accelerating Large-number Arithmetic

Large-number arithmetic, widely used in scientific computing and cryptography, has seen limited adoption of single instruction, multiple data (SIMD) parallelism on modern CPUs due to the inherent dependencies in traditional algorithms. We…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-27 Subhrajit Das , Abhishek Bichhawat , Yuvraj Patel

Efficient Additions and Montgomery Reductions of Large Integers for SIMD

This paper presents efficient algorithms, designed to leverage SIMD for performing Montgomery reductions and additions on integers larger than 512 bits. The existing algorithms encounter inefficiencies when parallelized using SIMD due to…

Cryptography and Security · Computer Science 2023-09-01 Pengchang Ren , Reiji Suda , Vorapong Suppakitpaisarn

A Study of Performance Programming of CPU, GPU accelerated Computers and SIMD Architecture

Parallel computing is a standard approach to achieving high-performance computing (HPC). Three commonly used methods to implement parallel computing include: 1) applying multithreading technology on single-core or multi-core CPUs; 2)…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-18 Xinyao Yi

Elzar: Triple Modular Redundancy using Intel Advanced Vector Extensions (technical report)

Instruction-Level Redundancy (ILR) is a well-known approach to tolerate transient CPU faults. It replicates instructions in a program and inserts periodic checks to detect and correct CPU faults using majority voting, which essentially…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-08-25 Dmitrii Kuvaiskii , Oleksii Oleksenko , Pramod Bhatotia , Pascal Felber , Christof Fetzer

A flexible algorithm for calculating pair interactions on SIMD architectures

Calculating interactions or correlations between pairs of particles is typically the most time-consuming task in particle simulation or correlation analysis. Straightforward implementations using a double loop over particle pairs have…

Computational Physics · Physics 2015-06-16 Szilárd Páll , Berk Hess

Short reasons for long vectors in HPC CPUs: a study based on RISC-V

For years, SIMD/vector units have enhanced the capabilities of modern CPUs in High-Performance Computing (HPC) and mobile technology. Typical commercially-available SIMD units process up to 8 double-precision elements with one instruction.…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-11-14 Pablo Vizcaino , Georgios Ieronymakis , Nikolaos Dimou , Vassilis Papaefstathiou , Jesus Labarta , Filippo Mantovani

SIMDive: Approximate SIMD Soft Multiplier-Divider for FPGAs with Tunable Accuracy

The ever-increasing quest for data-level parallelism and variable precision in ubiquitous multimedia and Deep Neural Network (DNN) applications has motivated the use of Single Instruction, Multiple Data (SIMD) architectures. To alleviate…

Hardware Architecture · Computer Science 2020-11-03 Zahra Ebrahimi , Salim Ullah , Akash Kumar

Comparing the Performance of Different x86 SIMD Instruction Sets for a Medical Imaging Application on Modern Multi- and Manycore Chips

Single Instruction, Multiple Data (SIMD) vectorization is a major driver of performance in current architectures, and is mandatory for achieving good performance with codes that are limited by instruction throughput. We investigate the…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-01-30 Johannes Hofmann , Jan Treibig , Georg Hager , Gerhard Wellein

Acceleration of multiple precision matrix multiplication based on multi-component floating-point arithmetic using AVX2

In this paper, we report the results obtained from the acceleration of multi-binary64-type multiple precision matrix multiplication with AVX2. We target double-double (DD), triple-double (TD), and quad-double (QD) precision arithmetic…

Numerical Analysis · Mathematics 2021-09-14 Tomonori Kouya

A Novel SIMD-Optimized Implementation for Fast and Memory-Efficient Trigonometric Computation

This paper proposes a novel set of trigonometric implementations which are 5x faster than the inbuilt C++ functions. The proposed implementation is also highly memory efficient requiring no precomputations of any kind. Benchmark comparisons…

Mathematical Software · Computer Science 2025-02-18 Nikhil Dev Goyal , Parth Arora

Improving a Parallel C++ Intel AVX-512 SIMD Linear Genetic Programming Interpreter

We extend recent 256 SSE vector work to 512 AVX giving a four fold speedup. We use MAGPIE (Machine Automated General Performance Improvement via Evolution of software) to speedup a C++ linear genetic programming interpreter. Local search is…

Neural and Evolutionary Computing · Computer Science 2025-12-11 William B. Langdon