Related papers: High-Performance Code Generation though Fusion and…

Autovesk: Automatic vectorized code generation from unstructured static kernels using graph transformations

Leveraging the SIMD capability of modern CPU architectures is mandatory to take full benefit of their increasing performance. To exploit this feature, binary executables must be explicitly vectorized by the developers or an automatic…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-07-03 Hayfa Tayeb , Ludovic Paillat , Berenger Bramas

The Fused Kernel Library: A C++ API to Develop Highly-Efficient GPU Libraries

Existing GPU libraries often struggle to fully exploit the parallel resources and on-chip memory (SRAM) of GPUs when chaining multiple GPU functions as individual kernels. While Kernel Fusion (KF) techniques like Horizontal Fusion (HF) and…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-09 Oscar Amoros , Albert Andaluz , Johnny Nunez , Antonio J. Pena

Towards High-Performance and Portable Molecular Docking on CPUs through Vectorization

Recent trends in the HPC field have introduced new CPU architectures with improved vectorization capabilities that require optimization to achieve peak performance and thus pose challenges for performance portability. The deployment of…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-17 Gianmarco Accordi , Jens Domke , Theresa Pollinger , Davide Gadioli , Gianluca Palermo

Model-Based Performance Analysis of the HyTeG Finite Element Framework

In this work, we present how code generation techniques significantly improve the performance of the computational kernels in the HyTeG software framework. This HPC framework combines the performance and memory advantages of matrix-free…

Performance · Computer Science 2023-05-25 Dominik Thönnes , Ulrich Rüde

Hierarchical Embedding Fusion for Retrieval-Augmented Code Generation

Retrieval-augmented code generation often conditions the decoder on large retrieved code snippets. This ties online inference cost to repository size and introduces noise from long contexts. We present Hierarchical Embedding Fusion (HEF), a…

Computation and Language · Computer Science 2026-03-10 Nikita Sorokin , Ivan Sedykh , Valentin Malykh

Optimizing CUDA Code By Kernel Fusion---Application on BLAS

Modern GPUs are able to perform significantly more arithmetic operations than transfers of a single word to or from global memory. Hence, many GPU kernels are limited by memory bandwidth and cannot exploit the arithmetic power of GPUs.…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-09-13 J. Filipovič , M. Madzin , J. Fousek , L. Matyska

Exploiting long vectors with a CFD code: a co-design show case

A current trend in HPC systems is the utilization of architectures with SIMD or vector extensions to exploit data parallelism. There are several ways to take advantage of such modern vector architectures, each with a different impact on the…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-05 Marc Blancafort , Roger Ferrer , Guillaume Houzeaux , Marta Garcia-Gasulla , Filippo Mantovani

VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference

In this work, we design and implement VQ-LLM, an efficient fused Vector Quantization (VQ) kernel generation framework. We first introduce a software abstraction called codebook cache to optimize codebook access efficiency and support the…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-01 Zihan Liu , Xinhao Luo , Junxian Guo , Wentao Ni , Yangjie Zhou , Yue Guan , Cong Guo , Weihao Cui , Yu Feng , Minyi Guo , Yuhao Zhu , Minjia Zhang , Jingwen Leng , Chen Jin

Hybrid CPU-GPU generation of the Hamiltonian and Overlap matrices in FLAPW methods

In this paper we focus on the integration of high-performance numerical libraries in ab initio codes and the portability of performance and scalability. The target of our work is FLEUR, a software for electronic structure calculations…

Computational Engineering, Finance, and Science · Computer Science 2016-11-03 Diego Fabregat-Traver , Davor Davidović , Markus Höhnerbach , Edoardo Di Napoli

Towards An Approach to Identify Divergences in Hardware Designs for HPC Workloads

Developing efficient hardware accelerators for mathematical kernels used in scientific applications and machine learning has traditionally been a labor-intensive task. These accelerators typically require low-level programming in Verilog or…

Hardware Architecture · Computer Science 2025-09-15 Doru Thom Popovici , Mario Vega , Angelos Ioannou , Fabien Chaix , Dania Mosuli , Blair Reasoner , Tan Nguyen , Xiaokun Yang , John Shalf

Fast Access to Columnar, Hierarchically Nested Data via Code Transformation

Big Data query systems represent data in a columnar format for fast, selective access, and in some cases (e.g. Apache Drill), perform calculations directly on the columnar data without row materialization, avoiding runtime costs. However,…

Programming Languages · Computer Science 2017-11-06 Jim Pivarski , Peter Elmer , Brian Bockelman , Zhe Zhang

High-performance generation of the Hamiltonian and Overlap matrices in FLAPW methods

One of the greatest efforts of computational scientists is to translate the mathematical model describing a class of physical phenomena into large and complex codes. Many of these codes face the difficulty of implementing the mathematical…

Computational Engineering, Finance, and Science · Computer Science 2018-01-17 Edoardo Di Napoli , Elmar Peise , Markus Hrywniak , Paolo Bientinesi

Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures

During the past decade, Deep Learning (DL) algorithms, programming systems and hardware have converged with the High Performance Computing (HPC) counterparts. Nevertheless, the programming methodology of DL and HPC systems is stagnant,…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-03-19 Evangelos Georganas , Dhiraj Kalamkar , Kirill Voronin , Abhisek Kundu , Antonio Noack , Hans Pabst , Alexander Breuer , Alexander Heinecke

A study of vectorization for matrix-free finite element methods

Vectorization is increasingly important to achieve high performance on modern hardware with SIMD instructions. Assembly of matrices and vectors in the finite element method, which is characterized by iterating a local assembly kernel over…

Mathematical Software · Computer Science 2020-08-26 Tianjiao Sun , Lawrence Mitchell , Kaushik Kulkarni , Andreas Klöckner , David A. Ham , Paul H. J. Kelly

Transformations of High-Level Synthesis Codes for High-Performance Computing

Spatial computing architectures promise a major stride in performance and energy efficiency over the traditional load/store devices currently employed in large scale computing systems. The adoption of high-level synthesis (HLS) from…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-11-24 Johannes de Fine Licht , Maciej Besta , Simon Meierhans , Torsten Hoefler

FFConv: Fast Factorized Convolutional Neural Network Inference on Encrypted Data

Homomorphic Encryption (HE), allowing computations on encrypted data (ciphertext) without decrypting it first, enables secure but prohibitively slow Convolutional Neural Network (CNN) inference for privacy-preserving applications in clouds.…

Cryptography and Security · Computer Science 2022-06-23 Yuxiao Lu , Jie Lin , Chao Jin , Zhe Wang , Min Wu , Khin Mi Mi Aung , Xiaoli Li

HetConv: Heterogeneous Kernel-Based Convolutions for Deep CNNs

We present a novel deep learning architecture in which the convolution operation leverages heterogeneous kernels. The proposed HetConv (Heterogeneous Kernel-Based Convolution) reduces the computation (FLOPs) and the number of parameters as…

Computer Vision and Pattern Recognition · Computer Science 2019-03-26 Pravendra Singh , Vinay Kumar Verma , Piyush Rai , Vinay P. Namboodiri

Learning Random Fourier Features by Hybrid Constrained Optimization

The kernel embedding algorithm is an important component for adapting kernel methods to large datasets. Since the algorithm consumes a major computation cost in the testing phase, we propose a novel teacher-learner framework of learning…

Machine Learning · Statistics 2017-12-08 Jianqiao Wangni , Jingwei Zhuo , Jun Zhu

A Kernel Search Algorithm for Virtual Machine Consolidation Problem

Virtual machine consolidation describes the process of reallocation of virtual machines (VMs) on a set of target servers. It can be formulated as a mixed integer linear programming problem which is proven to be an NP-hard problem. In this…

Optimization and Control · Mathematics 2022-12-29 Jiang-Yao Luo , Jian-Hua Yuan

High-performance Kernel Machines with Implicit Distributed Optimization and Randomization

In order to fully utilize "big data", it is often required to use "big models". Such models tend to grow with the complexity and size of the training data, and do not make strong parametric assumptions upfront on the nature of the…

Machine Learning · Statistics 2015-04-17 Vikas Sindhwani , Haim Avron