Related papers: A Case Study of LLVM-Based Analysis for Optimizing…

SIMDive: Approximate SIMD Soft Multiplier-Divider for FPGAs with Tunable Accuracy

The ever-increasing quest for data-level parallelism and variable precision in ubiquitous multimedia and Deep Neural Network (DNN) applications has motivated the use of Single Instruction, Multiple Data (SIMD) architectures. To alleviate…

Hardware Architecture · Computer Science 2020-11-03 Zahra Ebrahimi , Salim Ullah , Akash Kumar

Tutoring LLM into a Better CUDA Optimizer

Recent leaps in large language models (LLMs) caused a revolution in programming tools (like GitHub Copilot) that can help with code generation, debugging, and even performance optimization. In this paper, we focus on the capabilities of the…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-21 Matyáš Brabec , Jiří Klepl , Michal Töpfer , Martin Kruliš

EdgeLLM: A Highly Efficient CPU-FPGA Heterogeneous Edge Accelerator for Large Language Models

The rapid advancements in artificial intelligence (AI), particularly the Large Language Models (LLMs), have profoundly affected our daily work and communication forms. However, it is still a challenge to deploy LLMs on resource-constrained…

Hardware Architecture · Computer Science 2025-03-03 Mingqiang Huang , Ao Shen , Kai Li , Haoxiang Peng , Boyu Li , Yupeng Su , Hao Yu

Extending the RISC-V ISA for exploring advanced reconfigurable SIMD instructions

This paper presents a novel, non-standard set of vector instruction types for exploring custom SIMD instructions in a softcore. The new types allow simultaneous access to a relatively high number of operands, reducing the instruction count…

Hardware Architecture · Computer Science 2021-06-15 Philippos Papaphilippou , Paul H. J. Kelly , Wayne Luk

SIMD Parallel MCMC Sampling with Applications for Big-Data Bayesian Analytics

Computational intensity and sequential nature of estimation techniques for Bayesian methods in statistics and machine learning, combined with their increasing applications for big data analytics, necessitate both the identification of…

Computation · Statistics 2015-03-02 Alireza S. Mahani , Mansour T. A. Sharabiani

SimdBench: Benchmarking Large Language Models for SIMD-Intrinsic Code Generation

SIMD (Single Instruction Multiple Data) instructions and their compiler intrinsics are widely supported by modern processors to accelerate performance-critical tasks. SIMD intrinsic programming, a trade-off between coding productivity and…

Software Engineering · Computer Science 2025-07-22 Yibo He , Shuoran Zhao , Jiaming Huang , Yingjie Fu , Hao Yu , Cunjian Huang , Tao Xie

CUDA-LLM: LLMs Can Write Efficient CUDA Kernels

Large Language Models (LLMs) have demonstrated strong capabilities in general-purpose code generation. However, generating the code which is deeply hardware-specific, architecture-aware, and performance-critical, especially for massively…

Machine Learning · Computer Science 2025-06-12 Wentao Chen , Jiace Zhu , Qi Fan , Yehan Ma , An Zou

Towards Autotuning of OpenMP Applications on Multicore Architectures

In this paper we describe an autotuning tool for optimization of OpenMP applications on highly multicore and multithreaded architectures. Our work was motivated by in-depth performance analysis of scientific applications and synthetic…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-01-17 Jakub Katarzyński , Maciej Cytowski

AccLLM: Accelerating Long-Context LLM Inference Via Algorithm-Hardware Co-Design

Recently, large language models (LLMs) have achieved huge success in the natural language processing (NLP) field, driving a growing demand to extend their deployment from the cloud to edge devices. However, deploying LLMs on…

Hardware Architecture · Computer Science 2025-05-08 Yanbiao Liang , Huihong Shi , Haikuo Shao , Zhongfeng Wang

AI Powered Compiler Techniques for DL Code Optimization

Creating high performance implementations of deep learning primitives on CPUs is a challenging task. Multiple considerations including multi-level cache hierarchy, and wide SIMD units of CPU platforms influence the choice of program…

Programming Languages · Computer Science 2021-04-13 Sanket Tavarageri , Gagandeep Goyal , Sasikanth Avancha , Bharat Kaul , Ramakrishna Upadrasta

CIDPro: Custom Instructions for Dynamic Program Diversification

Timing side-channel attacks pose a major threat to embedded systems due to their ease of accessibility. We propose CIDPro, a framework that relies on dynamic program diversification to mitigate timing side-channel leakage. The proposed…

Cryptography and Security · Computer Science 2018-09-06 Thinh Hung Pham , Alexander Fell , Arnab Kumar Biswas , Siew-Kei Lam , Nandeesha Veeranna

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

The rapid growth of deep learning has driven exponential increases in model parameters and computational demands. NVIDIA GPUs and their CUDA-based software ecosystem provide robust support for parallel computing, significantly alleviating…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-08 Jiaqi Lv , Xufeng He , Yanchen Liu , Xu Dai , Aocheng Shen , Yinghao Li , Jiachen Hao , Jianrong Ding , Yang Hu , Shouyi Yin

Llama-based source code vulnerability detection: Prompt engineering vs Fine tuning

The significant increase in software production, driven by the acceleration of development cycles over the past two decades, has led to a steady rise in software vulnerabilities, as shown by statistics published yearly by the CVE program.…

Software Engineering · Computer Science 2025-12-11 Dyna Soumhane Ouchebara , Stéphane Dupont

Learning Performance-Improving Code Edits

With the decline of Moore's law, optimizing program performance has become a major focus of software research. However, high-level optimizations such as API and algorithm changes remain elusive due to the difficulty of understanding the…

Software Engineering · Computer Science 2024-04-29 Alexander Shypula , Aman Madaan , Yimeng Zeng , Uri Alon , Jacob Gardner , Milad Hashemi , Graham Neubig , Parthasarathy Ranganathan , Osbert Bastani , Amir Yazdanbakhsh

DMax: Aggressive Parallel Decoding for dLLMs

We present DMax, a new paradigm for efficient diffusion language models (dLLMs). It mitigates error accumulation in parallel decoding, enabling aggressive decoding parallelism while preserving generation quality. Unlike conventional masked…

Machine Learning · Computer Science 2026-05-18 Zigeng Chen , Gongfan Fang , Xinyin Ma , Ruonan Yu , Xinchao Wang

A Precision Emulation Approach to the GPU Acceleration of Ab Initio Electronic Structure Calculations

This study explores the use of INT8-based emulation for accelerating traditional FP64-based HPC workloads on modern GPU architectures. Through SCILIB-Accel automatic BLAS offload tool for cache-coherent Unified Memory Architecture, we…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-01 Hang Liu , Junjie Li , Yinzhi Wang , Niraj K. Nepal , Yang Wang

LLM4VV: Developing LLM-Driven Testsuite for Compiler Validation

Large language models (LLMs) are a new and powerful tool for a wide span of applications involving natural language and demonstrate impressive code generation abilities. The goal of this work is to automatically generate tests and use these…

Artificial Intelligence · Computer Science 2024-03-12 Christian Munley , Aaron Jarmusch , Sunita Chandrasekaran

VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference

In this work, we design and implement VQ-LLM, an efficient fused Vector Quantization (VQ) kernel generation framework. We first introduce a software abstraction called codebook cache to optimize codebook access efficiency and support the…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-01 Zihan Liu , Xinhao Luo , Junxian Guo , Wentao Ni , Yangjie Zhou , Yue Guan , Cong Guo , Weihao Cui , Yu Feng , Minyi Guo , Yuhao Zhu , Minjia Zhang , Jingwen Leng , Chen Jin

DFLOP: A Data-driven Framework for Multimodal LLM Training Pipeline Optimization

Multimodal Large Language Models (MLLMs) have achieved remarkable advances by integrating text, image, and audio understanding within a unified architecture. However, existing distributed training frameworks remain fundamentally data-blind:…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-20 Hyeonjun An , Sihyun Kim , Chaerim Lim , Hyunjoon Kim , Rathijit Sen , Sangmin Jung , Hyeonsoo Lee , Dongwook Kim , Takki Yu , Jinkyu Jeong , Youngsok Kim , Kwanghyun Park

HLSPilot: LLM-based High-Level Synthesis

Large language models (LLMs) have catalyzed an upsurge in automatic code generation, garnering significant attention for register transfer level (RTL) code generation. Despite the potential of RTL code generation with natural language, it…

Hardware Architecture · Computer Science 2024-08-14 Chenwei Xiong , Cheng Liu , Huawei Li , Xiaowei Li