Related papers: Mirage Persistent Kernel: A Compiler and Runtime f…

Mirage: A Multi-Level Superoptimizer for Tensor Programs

We introduce Mirage, the first multi-level superoptimizer for tensor programs. A key idea in Mirage is $\mu$Graphs, a uniform representation of tensor programs at the kernel, thread block, and thread levels of the GPU compute hierarchy.…

Machine Learning · Computer Science 2025-06-09 Mengdi Wu , Xinhao Cheng , Shengyu Liu , Chunan Shi , Jianan Ji , Kit Ao , Praveen Velliengiri , Xupeng Miao , Oded Padon , Zhihao Jia

Stripe: Tensor Compilation via the Nested Polyhedral Model

Hardware architectures and machine learning (ML) libraries evolve rapidly. Traditional compilers often fail to generate high-performance code across the spectrum of new hardware offerings. To mitigate, engineers develop hand-tuned kernels…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-03-18 Tim Zerrell , Jeremy Bruestle

MKPipe: A Compiler Framework for Optimizing Multi-Kernel Workloads in OpenCL for FPGA

OpenCL for FPGA enables developers to design FPGAs using a programming model similar for processors. Recent works have shown that code optimization at the OpenCL level is important to achieve high computational efficiency. However, existing…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-02-06 Ji Liu , Abdullah-Al Kafi , Xipeng Shen , Huiyang Zhou

TPU-MLIR: A Compiler For TPU Using MLIR

Multi-level intermediate representations (MLIR) show great promise for reducing the cost of building domain-specific compilers by providing a reusable and extensible compiler infrastructure. This work presents TPU-MLIR, an end-to-end…

Programming Languages · Computer Science 2023-02-10 Pengchao Hu , Man Lu , Lei Wang , Guoyue Jiang

A Framework for Fine-Grained Synchronization of Dependent GPU Kernels

Machine Learning (ML) models execute several parallel computations including Generalized Matrix Multiplication, Convolution, Dropout, etc. These computations are commonly executed on Graphics Processing Units (GPUs), by dividing the…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-02-15 Abhinav Jangda , Saeed Maleki , Maryam Mehri Dehnavi , Madan Musuvathi , Olli Saarikivi

Kernelized Multiview Projection

Conventional vision algorithms adopt a single type of feature or a simple concatenation of multiple features, which is always represented in a high-dimensional space. In this paper, we propose a novel unsupervised spectral embedding…

Computer Vision and Pattern Recognition · Computer Science 2015-08-05 Mengyang Yu , Li Liu , Ling Shao

Supervised Multiple Kernel Learning approaches for multi-omics data integration

Advances in high-throughput technologies have originated an ever-increasing availability of omics datasets. The integration of multiple heterogeneous data sources is currently an issue for biology and bioinformatics. Multiple kernel…

Machine Learning · Statistics 2024-12-04 Mitja Briscik , Gabriele Tazza , Marie-Agnes Dillies , László Vidács , Sébastien Dejean

PERKS: a Locality-Optimized Execution Model for Iterative Memory-bound GPU Applications

Iterative memory-bound solvers commonly occur in HPC codes. Typical GPU implementations have a loop on the host side that invokes the GPU kernel as much as time/algorithm steps there are. The termination of each kernel implicitly acts the…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-05-15 Lingqi Zhang , Mohamed Wahib , Peng Chen , Jintao Meng , Xiao Wang , Toshio Endo , Satoshi Matsuoka

A User's Guide to $\texttt{KSig}$: GPU-Accelerated Computation of the Signature Kernel

The signature kernel is a positive definite kernel for sequential and temporal data that has become increasingly popular in machine learning applications due to powerful theoretical guarantees, strong empirical performance, and recently…

Machine Learning · Statistics 2025-01-15 Csaba Tóth , Danilo Jr Dela Cruz , Harald Oberhauser

MEEK: Re-thinking Heterogeneous Parallel Error Detection Architecture for Real-World OoO Superscalar Processors

Heterogeneous parallel error detection is an approach to achieving fault-tolerant processors, leveraging multiple power-efficient cores to re-execute software originally run on a high-performance core. Yet, its complex components, gathering…

Hardware Architecture · Computer Science 2025-04-03 Zhe Jiang , Minli Liao , Sam Ainsworth , Dean You , Timothy Jones

Exploring Fine-grained Task Parallelism on Simultaneous Multithreading Cores

Nowadays, latency-critical, high-performance applications are parallelized even on power-constrained client systems to improve performance. However, an important scenario of fine-grained tasking on simultaneous multithreading CPU cores in…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-03 Denis Los , Igor Petushkov

CMLCompiler: A Unified Compiler for Classical Machine Learning

Classical machine learning (CML) occupies nearly half of machine learning pipelines in production applications. Unfortunately, it fails to utilize the state-of-the-practice devices fully and performs poorly. Without a unified framework, the…

Machine Learning · Computer Science 2023-05-01 Xu Wen , Wanling Gao , Anzheng Li , Lei Wang , Zihan Jiang , Jianfeng Zhan

Ada-MK: Adaptive MegaKernel Optimization via Automated DAG-based Search for LLM Inference

When large language models (LLMs) serve real-time inference in commercial online advertising systems, end-to-end latency must be strictly bounded to the millisecond range. Yet every token generated during the decode phase triggers thousands…

Computation and Language · Computer Science 2026-05-13 Wenxin Dong , Mingqing Hu , Guanghui Yu , Qiang Fu , Peng Xu , Hui Xu , Yue Xing , Xuewu Jiao , Shuanglong Li , Lin Liu

Multiplierless MP-Kernel Machine For Energy-efficient Edge Devices

We present a novel framework for designing multiplierless kernel machines that can be used on resource-constrained platforms like intelligent edge devices. The framework uses a piecewise linear (PWL) approximation based on a margin…

Machine Learning · Computer Science 2022-09-12 Abhishek Ramdas Nair , Pallab Kumar Nath , Shantanu Chakrabartty , Chetan Singh Thakur

Twinkle: A GPU-based binary-lens microlensing code with contour integration method

With the rapidly increasing rate of microlensing planet detections, microlensing modeling software faces significant challenges in computation efficiency. Here, we develop the Twinkle code, an efficient and robust binary-lens modeling…

Instrumentation and Methods for Astrophysics · Physics 2025-03-18 Suwei Wang , Lile Wang , Subo Dong

Building a Reusable and Extensible Automatic Compiler Infrastructure for Reconfigurable Devices

Multi-Level Intermediate Representation (MLIR) is gaining increasing attention in reconfigurable hardware communities due to its capability to represent various abstract levels for software compilers. This project aims to be the first to…

Hardware Architecture · Computer Science 2024-01-22 Zhenya Zang , Uwe Dolinsky , Pietro Ghiglio , Stefano Cherubin , Mehdi Goli , Shufan Yang

MING: An Automated CNN-to-Edge MLIR HLS framework

Driven by the increasing demand for low-latency and real-time processing, machine learning applications are steadily migrating toward edge computing platforms, where Field-Programmable Gate Arrays (FPGAs) are widely adopted for their energy…

Hardware Architecture · Computer Science 2026-02-13 Jiahong Bi , Lars Schütze , Jeronimo Castrillon

Exploiting pre-optimized kernels with polyhedral transformations for CGRA compilation

Modern computing workloads commonly involve matrix-matrix multiplication (mmul) as a core computing pattern. Coarse-Grained Reconfigurable Arrays (CGRAs) can flexibly and efficiently support it, since they combine operation-level…

Hardware Architecture · Computer Science 2026-04-29 Yuxuan Wang , María José Belda , Fernando Castro , Katzalin Olcoz , David Atienza , Giovanni Ansaloni

Library Liberation: Competitive Performance Matmul Through Compiler-composed Nanokernels

The rapidly evolving landscape of AI and machine learning workloads has widened the gap between high-level domain operations and efficient hardware utilization. Achieving near-peak performance still demands deep hardware expertise-experts…

Machine Learning · Computer Science 2025-11-19 Arun Thangamani , Md Asghar Ahmad Shahid , Adam Siemieniuk , Rolf Morel , Renato Golin , Alexander Heinecke

Cortex: A Compiler for Recursive Deep Learning Models

Optimizing deep learning models is generally performed in two steps: (i) high-level graph optimizations such as kernel fusion and (ii) low level kernel optimizations such as those found in vendor libraries. This approach often leaves…

Machine Learning · Computer Science 2021-03-08 Pratik Fegade , Tianqi Chen , Phillip B. Gibbons , Todd C. Mowry