Related papers: Optimized Speculative Sampling for GPU Hardware Ac…

Speculative Parallel Evaluation Of Classification Trees On GPGPU Compute Engines

We examine the problem of optimizing classification tree evaluation for on-line and real-time applications by using GPUs. Looking at trees with continuous attributes often used in image segmentation, we first put the existing algorithms for…

Distributed, Parallel, and Cluster Computing · Computer Science 2011-11-08 Jason Spencer

Efficient softmax approximation for GPUs

We propose an approximate strategy to efficiently train neural network based language models over very large vocabularies. Our approach, called adaptive softmax, circumvents the linear dependency on the vocabulary size by exploiting the…

Computation and Language · Computer Science 2017-06-20 Edouard Grave , Armand Joulin , Moustapha Cissé , David Grangier , Hervé Jégou

Exploring Parallelism in FPGA-Based Accelerators for Machine Learning Applications

Speculative backpropagation has emerged as a promising technique to accelerate the training of neural networks by overlapping the forward and backward passes. Leveraging speculative weight updates when error gradients fall within a specific…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-18 Sed Centeno , Christopher Sprague , Arnab A Purkayastha , Ray Simar , Neeraj Magotra

A Tool for Automatically Suggesting Source-Code Optimizations for Complex GPU Kernels

Future computing systems, from handhelds to supercomputers, will undoubtedly be more parallel and heterogeneous than todays systems to provide more performance and energy efficiency. Thus, GPUs are increasingly being used to accelerate…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-18 Saeed Taheri , Apan Qasem , Martin Burtscher

DPP-PMRF: Rethinking Optimization for a Probabilistic Graphical Model Using Data-Parallel Primitives

We present a new parallel algorithm for probabilistic graphical model optimization. The algorithm relies on data-parallel primitives (DPPs), which provide portable performance over hardware architecture. We evaluate results on CPUs and GPUs…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-09-14 Brenton Lessley , Talita Perciano , Colleen Heinemann , David Camp , Hank Childs , E. Wes Bethel

Hardware Acceleration of Sampling Algorithms in Sample and Aggregate Graph Neural Networks

Sampling is an important process in many GNN structures in order to train larger datasets with a smaller computational complexity. However, compared to other processes in GNN (such as aggregate, backward propagation), the sampling process…

Machine Learning · Computer Science 2022-09-08 Yuchen Gui , Boyi Wei , Wei Yuan , Xi Jin

Accelerating Sparse Approximate Matrix Multiplication on GPUs

Although the matrix multiplication plays a vital role in computational linear algebra, there are few efficient solutions for matrix multiplication of the near-sparse matrices. The Sparse Approximate Matrix Multiply (SpAMM) is one of the…

Performance · Computer Science 2022-10-25 Xiaoyan Liu , Yi Liu , Ming Dun , Bohong Yin , Hailong Yang , Zhongzhi Luan , Depei Qian

A Preliminary Study on Accelerating Simulation Optimization with GPU Implementation

We provide a preliminary study on utilizing GPU (Graphics Processing Unit) to accelerate computation for three simulation optimization tasks with either first-order or second-order algorithms. Compared to the implementation using only CPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-04-19 Jinghai He , Haoyu Liu , Yuhang Wu , Zeyu Zheng , Tingyu Zhu

Parallelizing a modern GPU simulator

Simulators are a primary tool in computer architecture research but are extremely computationally intensive. Simulating modern architectures with increased core counts and recent workloads can be challenging, even on modern hardware. This…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-27 Rodrigo Huerta , Antonio González

Achieving Speedup in Aggregate Risk Analysis using Multiple GPUs

Stochastic simulation techniques employed for the analysis of portfolios of insurance/reinsurance risk, often referred to as `Aggregate Risk Analysis', can benefit from exploiting state-of-the-art high-performance computing platforms. In…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-08-19 A. K. Bahl , O. Baltzer , A. Rau-Chaplin , B. Varghese , A. Whiteway

A GPU-Accelerated Distributed Algorithm for Optimal Power Flow in Distribution Systems

We propose a GPU-accelerated distributed optimization algorithm for controlling multi-phase optimal power flow in active distribution systems with dynamically changing topologies. To handle varying network configurations and enable…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-01-15 Minseok Ryu , Geunyeong Byeon , Kibaek Kim

A Survey on Agent-based Simulation using Hardware Accelerators

Due to decelerating gains in single-core CPU performance, computationally expensive simulations are increasingly executed on highly parallel hardware platforms. Agent-based simulations, where simulated entities act with a certain degree of…

Multiagent Systems · Computer Science 2018-07-04 Jiajian Xiao , Philipp Andelfinger , David Eckhoff , Wentong Cai , Alois Knoll

GPU acceleration of the particle filter: the Metropolis resampler

We consider deployment of the particle filter on modern massively parallel hardware architectures, such as Graphics Processing Units (GPUs), with a focus on the resampling stage. While standard multinomial and stratified resamplers require…

Computation · Statistics 2012-02-29 Lawrence Murray

Accelerating Large Language Model Decoding with Speculative Sampling

We present speculative sampling, an algorithm for accelerating transformer decoding by enabling the generation of multiple tokens from each transformer call. Our algorithm relies on the observation that the latency of parallel scoring of…

Computation and Language · Computer Science 2023-02-03 Charlie Chen , Sebastian Borgeaud , Geoffrey Irving , Jean-Baptiste Lespiau , Laurent Sifre , John Jumper

SIMPLE: Disaggregating Sampling from GPU Inference into a Decision Plane for Faster Distributed LLM Serving

As large language models (LLMs) scale out with tensor parallelism (TP) and pipeline parallelism (PP) and production stacks have aggressively optimized the data plane (attention/GEMM and KV cache), sampling, the decision plane that turns…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-02 Bohan Zhao , Zane Cao , Yongchao He

Optimal Allocation of Replicas to Processors in Parallel Tempering Simulations

The optimal allocation of replicas to a homogeneous or heterogenous set of processors is derived for parallel tempering simulations on multi-processor machines. In the general case, it is possible without substantially increasing wall clock…

Computational Physics · Physics 2007-05-23 David J. Earl , Michael W. Deem

BASS: Batched Attention-optimized Speculative Sampling

Speculative decoding has emerged as a powerful method to improve latency and throughput in hosting large language models. However, most existing implementations focus on generating a single sequence. Real-world generative AI applications…

Machine Learning · Computer Science 2024-06-27 Haifeng Qian , Sujan Kumar Gonugondla , Sungsoo Ha , Mingyue Shang , Sanjay Krishna Gouda , Ramesh Nallapati , Sudipta Sengupta , Xiaofei Ma , Anoop Deoras

Bandwidth-Optimal Random Shuffling for GPUs

Linear-time algorithms that are traditionally used to shuffle data on CPUs, such as the method of Fisher-Yates, are not well suited to implementation on GPUs due to inherent sequential dependencies, and existing parallel shuffling…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-02-04 Rory Mitchell , Daniel Stokes , Eibe Frank , Geoffrey Holmes

Parallel-in-Time Nonlinear Optimal Control via GPU-native Sequential Convex Programming

Real-time trajectory optimization for nonlinear constrained autonomous systems is critical and typically performed by CPU-based sequential solvers. Specifically, reliance on global sparse linear algebra or the serial nature of dynamic…

Robotics · Computer Science 2026-03-13 Yilin Zou , Zhong Zhang , Maxime Robic , Fanghua Jiang

Parallel Sparse Matrix Solver on the GPU Applied to Simulation of Electrical Machines

Nowadays, several industrial applications are being ported to parallel architectures. In fact, these platforms allow acquire more performance for system modelling and simulation. In the electric machines area, there are many problems which…

Distributed, Parallel, and Cluster Computing · Computer Science 2010-10-25 Antonio Wendell De Oliveira Rodrigues , Frédéric Guyomarch , Yvonnick Le Menach , Jean-Luc Dekeyser