Related papers: Autovesk: Automatic vectorized code generation fro…

AutoVecCoder: Teaching LLMs to Generate Explicitly Vectorized Code

Vectorization via Single Instruction, Multiple Data (SIMD) architectures is a cornerstone of high-performance computing. To fully exploit hardware potential, developers often resort to explicit vectorization using intrinsics, as…

Computation and Language · Computer Science 2026-05-19 Shangzhan Li , Xinyu Yin , Xuanyu Jin , Ye He , Yuxin Zhou , Yuxuan Li , Xu Han , Wanxiang Che , Qi Shi , Ting Liu , Maosong Sun

High-Performance Code Generation though Fusion and Vectorization

We present a technique for automatically transforming kernel-based computations in disparate, nested loops into a fused, vectorized form that can reduce intermediate storage needs and lead to improved performance on contemporary hardware.…

Performance · Computer Science 2017-10-25 Jason Sewall , Simon J. Pennycook

Automatic Code Generation for High-Performance Discontinuous Galerkin Methods on Modern Architectures

SIMD vectorization has lately become a key challenge in high-performance computing. However, hand-written explicitly vectorized code often poses a threat to the software's sustainability. In this publication we solve this sustainability and…

Numerical Analysis · Mathematics 2018-12-20 Dominic Kempf , René Heß , Steffen Müthing , Peter Bastian

Retrofitting Control Flow Graphs in LLVM IR for Auto Vectorization

Modern processors increasingly rely on SIMD instruction sets, such as AVX and RVV, to significantly enhance parallelism and computational performance. However, production-ready compilers like LLVM and GCC often fail to fully exploit…

Programming Languages · Computer Science 2025-10-07 Shihan Fang , Wenxin Zheng

Revec: Program Rejuvenation through Revectorization

Modern microprocessors are equipped with Single Instruction Multiple Data (SIMD) or vector instructions which expose data level parallelism at a fine granularity. Programmers exploit this parallelism by using low-level vector intrinsics in…

Programming Languages · Computer Science 2019-02-11 Charith Mendis , Ajay Jain , Paras Jain , Saman Amarasinghe

Exploiting long vectors with a CFD code: a co-design show case

A current trend in HPC systems is the utilization of architectures with SIMD or vector extensions to exploit data parallelism. There are several ways to take advantage of such modern vector architectures, each with a different impact on the…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-05 Marc Blancafort , Roger Ferrer , Guillaume Houzeaux , Marta Garcia-Gasulla , Filippo Mantovani

A fast vectorized sorting implementation based on the ARM scalable vector extension (SVE)

The way developers implement their algorithms and how these implementations behave on modern CPUs are governed by the design and organization of these. The vectorization units (SIMD) are among the few CPUs' parts that can and must be…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-11-22 Bérenger Bramas

LLM-Vectorizer: LLM-based Verified Loop Vectorizer

Vectorization is a powerful optimization technique that significantly boosts the performance of high performance computing applications operating on large data arrays. Despite decades of research on auto-vectorization, compilers frequently…

Software Engineering · Computer Science 2024-06-10 Jubi Taneja , Avery Laird , Cong Yan , Madan Musuvathi , Shuvendu K. Lahiri

A study of vectorization for matrix-free finite element methods

Vectorization is increasingly important to achieve high performance on modern hardware with SIMD instructions. Assembly of matrices and vectors in the finite element method, which is characterized by iterating a local assembly kernel over…

Mathematical Software · Computer Science 2020-08-26 Tianjiao Sun , Lawrence Mitchell , Kaushik Kulkarni , Andreas Klöckner , David A. Ham , Paul H. J. Kelly

ACC Saturator: Automatic Kernel Optimization for Directive-Based GPU Code

Automatic code optimization is a complex process that typically involves the application of multiple discrete algorithms that modify the program structure irreversibly. However, the design of these algorithms is often monolithic, and they…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-18 Kazuaki Matsumura , Simon Garcia De Gonzalo , Antonio J. Peña

Performance of SSE and AVX Instruction Sets

SSE (streaming SIMD extensions) and AVX (advanced vector extensions) are SIMD (single instruction multiple data streams) instruction sets supported by recent CPUs manufactured in Intel and AMD. This SIMD programming allows parallel…

High Energy Physics - Lattice · Physics 2013-11-05 Hwancheol Jeong , Sunghoon Kim , Weonjong Lee , Seok-Ho Myung

Scalable Packed Layouts for Vector-Length-Agnostic ML Code Generation

Scalable vector instruction sets such as Arm SVE enable vector-length-agnostic (VLA) execution, allowing a single implementation to adapt across hardware with different vector lengths. However, they complicate compiler code generation, as…

Performance · Computer Science 2026-05-19 Ege Beysel , Maximilian Bartel , Jan Moritz Joseph

Im2Vec: Synthesizing Vector Graphics without Vector Supervision

Vector graphics are widely used to represent fonts, logos, digital artworks, and graphic designs. But, while a vast body of work has focused on generative algorithms for raster images, only a handful of options exists for vector graphics.…

Computer Vision and Pattern Recognition · Computer Science 2021-04-02 Pradyumna Reddy , Michael Gharbi , Michal Lukac , Niloy J. Mitra

Massively Parallel Graph Drawing and Representation Learning

To fully exploit the performance potential of modern multi-core processors, machine learning and data mining algorithms for big data must be parallelized in multiple ways. Today's CPUs consist of multiple cores, each following an…

Machine Learning · Computer Science 2020-11-09 Christian Böhm , Claudia Plant

An Efficient Vectorization Scheme for Stencil Computation

Stencil computation is one of the most important kernels in various scientific and engineering applications. A variety of work has focused on vectorization and tiling techniques, aiming at exploiting the in-core data parallelism and data…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-03-19 Kun Li , Liang Yuan , Yunquan Zhang , Yue Yue , Hang Cao , Pengqi Lu

The ARM Scalable Vector Extension

This article describes the ARM Scalable Vector Extension (SVE). Several goals guided the design of the architecture. First was the need to extend the vector processing capability associated with the ARM AArch64 execution state to better…

Hardware Architecture · Computer Science 2018-03-19 Nigel Stephens , Stuart Biles , Matthias Boettcher , Jacob Eapen , Mbou Eyole , Giacomo Gabrielli , Matt Horsnell , Grigorios Magklis , Alejandro Martinez , Nathanael Premillieu , Alastair Reid , Alejandro Rico , Paul Walker

Automatic Generation of Vectorized Montgomery Algorithm

Modular arithmetic is widely used in crytography and symbolic computation. This paper presents a vectorized Montgomery algorithm for modular multiplication, the key to fast modular arithmetic, that fully utilizes the SIMD instructions. We…

Mathematical Software · Computer Science 2016-09-06 Lingchuan Meng

Vectorization of Verilog Designs and its Effects on Verification and Synthesis

Vectorization is a compiler optimization that replaces multiple operations on scalar values with a single operation on vector values. Although common in traditional compilers such as rustc, clang, and gcc, vectorization is not common in the…

Programming Languages · Computer Science 2026-05-15 Maria Fernanda Oliveira Guimarães , Ulisses Rosa , Ian Trudel , João Victor Amorim Vieira , Augusto Amaral Mafra , Mirlaine Crepalde , Fernando Magno Quintão Pereira

Intelligent-Unrolling: Exploiting Regular Patterns in Irregular Applications

Modern optimizing compilers are able to exploit memory access or computation patterns to generate vectorization codes. However, such patterns in irregular applications are unknown until runtime due to the input dependence. Thus, either…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-10-26 Changxi Liu , Hailong Yang , Xu Liu , Zhongzhi Luan , Depei Qian

Sparsity-Specific Code Optimization using Expression Trees

We introduce a code generator that converts unoptimized C++ code operating on sparse data into vectorized and parallel CPU or GPU kernels. Our approach unrolls the computation into a massive expression graph, performs redundant expression…

Programming Languages · Computer Science 2022-03-15 Philipp Herholz , Xuan Tang , Teseo Schneider , Shoaib Kamil , Daniele Panozzo , Olga Sorkine-Hornung