Related papers: Array Program Transformation with Loo.py by Exampl…

Loo.py: From Fortran to performance via transformation and substitution rules

A large amount of numerically-oriented code is written and is being written in legacy languages. Much of this code could, in principle, make good use of data-parallel throughput-oriented computer architectures. Loo.py, a…

Programming Languages · Computer Science 2015-05-19 Andreas Klöckner

Loo.py: transformation-based code generation for GPUs and CPUs

Today's highly heterogeneous computing landscape places a burden on programmers wanting to achieve high performance on a reasonably broad cross-section of machines. To do so, computations need to be expressed in many different but…

Programming Languages · Computer Science 2014-06-02 Andreas Klöckner

ForOpenCL: Transformations Exploiting Array Syntax in Fortran for Accelerator Programming

Emerging GPU architectures for high performance computing are well suited to a data-parallel programming model. This paper presents preliminary work examining a programming methodology that provides Fortran programmers with access to these…

Programming Languages · Computer Science 2011-07-13 Matthew J. Sottile , Craig E Rasmussen , Wayne N. Weseloh , Robert W. Robey , Daniel Quinlan , Jeffrey Overbey

Functional Logic Program Transformations

Many tools used to process programs, like compilers, analyzers, or verifiers, perform transformations on their intermediate program representation, like abstract syntax trees. Implementing such program transformations is a non-trivial task,…

Programming Languages · Computer Science 2026-01-21 Michael Hanus , Steven Libby

A Quick Introduction to Functional Verification of Array-Intensive Programs

Array-intensive programs are often amenable to parallelization across many cores on a single machine as well as scaling across multiple machines and hence are well explored, especially in the domain of high-performance computing. These…

Programming Languages · Computer Science 2019-05-23 Kunal Banerjee , Chandan Karfa

Program Transformation to Identify List-Based Parallel Skeletons

Algorithmic skeletons are used as building-blocks to ease the task of parallel programming by abstracting the details of parallel implementation from the developer. Most existing libraries provide implementations of skeletons that are…

Programming Languages · Computer Science 2016-07-11 Venkatesh Kannan , G. W. Hamilton

Parameter-Efficient Finetuning of Transformers for Source Code

Pretrained Transformers achieve state-of-the-art performance in various code-processing tasks but may be too large to be deployed. As software development tools often incorporate modules for various purposes which may potentially use a…

Computation and Language · Computer Science 2022-12-13 Shamil Ayupov , Nadezhda Chirkova

High-Performance Code Generation though Fusion and Vectorization

We present a technique for automatically transforming kernel-based computations in disparate, nested loops into a fused, vectorized form that can reduce intermediate storage needs and lead to improved performance on contemporary hardware.…

Performance · Computer Science 2017-10-25 Jason Sewall , Simon J. Pennycook

Fast transforms over finite fields of characteristic two

An additive fast Fourier transform over a finite field of characteristic two efficiently evaluates polynomials at every element of an $\mathbb{F}_2$-linear subspace of the field. We view these transforms as performing a change of basis from…

Symbolic Computation · Computer Science 2018-07-23 Nicholas Coxon

Towards Automatic Learning of Heuristics for Mechanical Transformations of Procedural Code

The current trends in next-generation exascale systems go towards integrating a wide range of specialized (co-)processors into traditional supercomputers. Due to the efficiency of heterogeneous systems in terms of Watts and FLOPS per…

Programming Languages · Computer Science 2017-01-26 Guillermo Vigueras , Manuel Carro , Salvador Tamarit , Julio Mariño

Advanced Programming Platform for efficient use of Data Parallel Hardware

Graphics processing units (GPU) had evolved from a specialized hardware capable to render high quality graphics in games to a commodity hardware for effective processing blocks of data in a parallel schema. This evolution is particularly…

Distributed, Parallel, and Cluster Computing · Computer Science 2012-03-26 Luis Cabellos

A Unified, Hardware-Fitted, Cross-GPU Performance Model

We present a mechanism to symbolically gather performance-relevant operation counts from numerically-oriented subprograms (`kernels') expressed in the Loopy programming system, and apply these counts in a simple, linear model of kernel run…

Performance · Computer Science 2016-04-19 James Stevens , Andreas Klöckner

High Performance Computing Applied to Logistic Regression: A CPU and GPU Implementation Comparison

We present a versatile GPU-based parallel version of Logistic Regression (LR), aiming to address the increasing demand for faster algorithms in binary classification due to large data sets. Our implementation is a direct translation of the…

Machine Learning · Computer Science 2023-08-22 Nechba Mohammed , Mouhajir Mohamed , Sedjari Yassine

Functional design of efficient and parallelizable combinatorial generators using convolution

The application of program transformation and algebraic methods to the development of efficient combinatorial optimization (CO) algorithms relies on an exhaustive combinatorial generator for the problem specification, followed by the fusion…

Discrete Mathematics · Computer Science 2026-05-29 Xi He , Max. A. Little

Transformations of High-Level Synthesis Codes for High-Performance Computing

Spatial computing architectures promise a major stride in performance and energy efficiency over the traditional load/store devices currently employed in large scale computing systems. The adoption of high-level synthesis (HLS) from…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-11-24 Johannes de Fine Licht , Maciej Besta , Simon Meierhans , Torsten Hoefler

Conformal Computing: Algebraically connecting the hardware/software boundary using a uniform approach to high-performance computation for software and hardware applications

We present a systematic, algebraically based, design methodology for efficient implementation of computer programs optimized over multiple levels of the processor/memory and network hierarchy. Using a common formalism to describe the…

Mathematical Software · Computer Science 2008-03-18 Lenore R. Mullin , James E. Raynolds

Simple, Parallel, High-Performance Virtual Machines for Extreme Computations

We introduce a high-performance virtual machine (VM) written in a numerically fast language like Fortran or C to evaluate very large expressions. We discuss the general concept of how to perform computations in terms of a VM and present…

Computational Physics · Physics 2015-09-22 Bijan Chokoufe Nejad , Thorsten Ohl , Jürgen Reuter

Efficient GPU Implementation of Affine Index Permutations on Arrays

Optimal usage of the memory system is a key element of fast GPU algorithms. Unfortunately many common algorithms fail in this regard despite exhibiting great regularity in memory access patterns. In this paper we propose efficient kernels…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-07-18 Mathis Bouverot-Dupuis , Mary Sheeran

GT4Py: High Performance Stencils for Weather and Climate Applications using Python

All major weather and climate applications are currently developed using languages such as Fortran or C++. This is typical in the domain of high performance computing (HPC), where efficient execution is an important concern. Unfortunately,…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-11-15 Enrique G. Paredes , Linus Groner , Stefano Ubbiali , Hannes Vogt , Alberto Madonna , Kean Mariotti , Felipe Cruz , Lucas Benedicic , Mauro Bianco , Joost VandeVondele , Thomas C. Schulthess

A Performance Vocabulary for Affine Loop Transformations

Modern polyhedral compilers excel at aggressively optimizing codes with static control parts, but the state-of-practice to find high-performance polyhedral transformations especially for different hardware targets still largely involves…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-04-10 Martin Kong , Louis-Noël Pouchet