Related papers: Modular GPU Programming with Typed Perspectives

PRISM: Dynamic Primitive-Based Forecasting for Large-Scale GPU Cluster Workloads

Accurately forecasting GPU workloads is essential for AI infrastructure, enabling efficient scheduling, resource allocation, and power management. Modern workloads are highly volatile, multiple periodicity, and heterogeneous, making them…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-27 Xin Wu , Fei Teng , Xingwang Li , Bin Zheng , Qiang Duan

Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving

Serving large language models (LLMs) is expensive, especially for providers hosting many models, making cost reduction essential. The unique workload patterns of serving multiple LLMs (i.e., multi-LLM serving) create new opportunities and…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-14 Shan Yu , Jiarong Xing , Yifan Qiao , Mingyuan Ma , Yangmin Li , Yang Wang , Shuo Yang , Zhiqiang Xie , Shiyi Cao , Ke Bao , Ion Stoica , Harry Xu , Ying Sheng

A Variant of Concurrent Constraint Programming on GPU

The number of cores on graphical computing units (GPUs) is reaching thousands nowadays, whereas the clock speed of processors stagnates. Unfortunately, constraint programming solvers do not take advantage yet of GPU parallelism. One reason…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-07-26 Pierre Talbot , Frédéric Pinel , Pascal Bouvry

MGPU-TSM: A Multi-GPU System with Truly Shared Memory

The sizes of GPU applications are rapidly growing. They are exhausting the compute and memory resources of a single GPU, and are demanding the move to multiple GPUs. However, the performance of these applications scales sub-linearly with…

Hardware Architecture · Computer Science 2020-08-11 Saiful A. Mojumder , Yifan Sun , Leila Delshadtehrani , Yenai Ma , Trinayan Baruah , José L. Abellán , John Kim , David Kaeli , Ajay Joshi

Tilus: A Tile-Level GPGPU Programming Language for Low-Precision Computation

Serving Large Language Models (LLMs) is critical for AI-powered applications, yet it demands substantial computational resources, particularly in memory bandwidth and computational throughput. Low-precision computation has emerged as a key…

Machine Learning · Computer Science 2025-09-03 Yaoyao Ding , Bohan Hou , Xiao Zhang , Allan Lin , Tianqi Chen , Cody Yu Hao , Yida Wang , Gennady Pekhimenko

PRISM: Probabilistic Runtime Insights and Scalable Performance Modeling for Large-Scale Distributed Training

Large model training beyond tens of thousands of GPUs is an uncharted territory. At such scales, disruptions to the training process are not a matter of if, but a matter of when -- a stochastic process degrading training productivity.…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-14 Alicia Golden , Michael Kuchnik , Samuel Hsia , Zachary DeVito , Gu-Yeon Wei , David Brooks , Carole-Jean Wu

Task-Based Tensor Computations on Modern GPUs

Domain-specific, fixed-function units are becoming increasingly common in modern processors. As the computational demands of applications evolve, the capabilities and programming interfaces of these fixed-function units continue to change.…

Programming Languages · Computer Science 2025-04-10 Rohan Yadav , Michael Garland , Alex Aiken , Michael Bauer

Descend: A Safe GPU Systems Programming Language

Graphics Processing Units (GPU) offer tremendous computational power by following a throughput oriented computing paradigm where many thousand computational units operate in parallel. Programming this massively parallel hardware is…

Programming Languages · Computer Science 2023-05-08 Bastian Köpcke , Sergei Gorlatch , Michel Steuwer

Term Rewriting on GPUs

We present a way to implement term rewriting on a GPU. We do this by letting the GPU repeatedly perform a massively parallel evaluation of all subterms. We find that if the term rewrite systems exhibit sufficient internal parallelism, GPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-09-16 Johri van Eerd , Jan Friso Groote , Pieter Hijma , Jan Martens , Anton Wijs

Enabling predictable parallelism in single-GPU systems with persistent CUDA threads

Graphics Processing Unit, or GPUs, have been successfully adopted both for graphic computation in 3D applications, and for general purpose application (GP-GPUs), thank to their tremendous performance-per-watt. Recently, there is a big…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-10-03 Paolo Burgio

Optimizing Bloom Filters for Modern GPU Architectures

Bloom filters are a fundamental data structure for approximate membership queries, with applications ranging from data analytics to databases and genomics. Several variants have been proposed to accommodate parallel architectures. GPUs,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-18 Daniel Jünger , Kevin Kristensen , Yunsong Wang , Xiangyao Yu , Bertil Schmidt

GPU accelerated program synthesis: Enumerate semantics, not syntax!

Program synthesis is an umbrella term for generating programs and logical formulae from specifications. With the remarkable performance improvements that GPUs enable for deep learning, a natural question arose: can we also implement a…

Programming Languages · Computer Science 2025-04-29 Martin Berger , Nathanaël Fijalkow , Mojtaba Valizadeh

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on…

Computation and Language · Computer Science 2021-08-25 Deepak Narayanan , Mohammad Shoeybi , Jared Casper , Patrick LeGresley , Mostofa Patwary , Vijay Anand Korthikanti , Dmitri Vainbrand , Prethvi Kashinkunti , Julie Bernauer , Bryan Catanzaro , Amar Phanishayee , Matei Zaharia

Contract-Based General-Purpose GPU Programming

Using GPUs as general-purpose processors has revolutionized parallel computing by offering, for a large and growing set of algorithms, massive data-parallelization on desktop machines. An obstacle to widespread adoption, however, is the…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-10-14 Alexey Kolesnichenko , Christopher M. Poskitt , Sebastian Nanz , Bertrand Meyer

GPUVM: GPU-driven Unified Virtual Memory

Graphics Processing Units (GPUs) leverage massive parallelism and large memory bandwidth to support high-performance computing applications, such as multimedia rendering, crypto-mining, deep learning, and natural language processing. These…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-11 Nurlan Nazaraliyev , Elaheh Sadredini , Nael Abu-Ghazaleh

ML-Triton, A Multi-Level Compilation and Language Extension to Triton GPU Programming

In the era of LLMs, dense operations such as GEMM and MHA are critical components. These operations are well-suited for parallel execution using a tilebased approach. While traditional GPU programming often relies on low level interfaces…

Computation and Language · Computer Science 2025-03-27 Dewei Wang , Wei Zhu , Liyang Ling , Ettore Tiotto , Quintin Wang , Whitney Tsang , Julian Opperman , Jacky Deng

Systolic Computing on GPUs for Productive Performance

We propose a language and compiler to productively build high-performance {\it software systolic arrays} that run on GPUs. Based on a rigorous mathematical foundation (uniform recurrence equations and space-time transform), our language has…

Programming Languages · Computer Science 2020-11-02 Hongbo Rong , Xiaochen Hao , Yun Liang , Lidong Xu , Hong H Jiang , Pradeep Dubey

GPU-based parallelism for ASP-solving

Answer Set Programming (ASP) has become, the paradigm of choice in the field of logic programming and non-monotonic reasoning. Thanks to the availability of efficient solvers, ASP has been successfully employed in a large number of…

Artificial Intelligence · Computer Science 2019-09-05 Agostino Dovier , Andrea Formisano , Flavio Vella

A Study of Performance Programming of CPU, GPU accelerated Computers and SIMD Architecture

Parallel computing is a standard approach to achieving high-performance computing (HPC). Three commonly used methods to implement parallel computing include: 1) applying multithreading technology on single-core or multi-core CPUs; 2)…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-18 Xinyao Yi

From Sequential Nodes to GPU Batches: Parallel Branch and Bound for Optimal $k$-Sparse GLMs

GPUs have significantly accelerated first-order methods for large-scale optimization, especially in continuous optimization. However, this success has not transferred cleanly to problems with discrete variables, combinatorial structure, and…

Machine Learning · Computer Science 2026-05-22 Jiachang Liu , Andrea Lodi