English
Related papers

Related papers: Optimized Speculative Sampling for GPU Hardware Ac…

200 papers

We examine the problem of optimizing classification tree evaluation for on-line and real-time applications by using GPUs. Looking at trees with continuous attributes often used in image segmentation, we first put the existing algorithms for…

Distributed, Parallel, and Cluster Computing · Computer Science 2011-11-08 Jason Spencer

We propose an approximate strategy to efficiently train neural network based language models over very large vocabularies. Our approach, called adaptive softmax, circumvents the linear dependency on the vocabulary size by exploiting the…

Computation and Language · Computer Science 2017-06-20 Edouard Grave , Armand Joulin , Moustapha Cissé , David Grangier , Hervé Jégou

Speculative backpropagation has emerged as a promising technique to accelerate the training of neural networks by overlapping the forward and backward passes. Leveraging speculative weight updates when error gradients fall within a specific…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-18 Sed Centeno , Christopher Sprague , Arnab A Purkayastha , Ray Simar , Neeraj Magotra

Future computing systems, from handhelds to supercomputers, will undoubtedly be more parallel and heterogeneous than todays systems to provide more performance and energy efficiency. Thus, GPUs are increasingly being used to accelerate…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-18 Saeed Taheri , Apan Qasem , Martin Burtscher

We present a new parallel algorithm for probabilistic graphical model optimization. The algorithm relies on data-parallel primitives (DPPs), which provide portable performance over hardware architecture. We evaluate results on CPUs and GPUs…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-09-14 Brenton Lessley , Talita Perciano , Colleen Heinemann , David Camp , Hank Childs , E. Wes Bethel

Sampling is an important process in many GNN structures in order to train larger datasets with a smaller computational complexity. However, compared to other processes in GNN (such as aggregate, backward propagation), the sampling process…

Machine Learning · Computer Science 2022-09-08 Yuchen Gui , Boyi Wei , Wei Yuan , Xi Jin

Although the matrix multiplication plays a vital role in computational linear algebra, there are few efficient solutions for matrix multiplication of the near-sparse matrices. The Sparse Approximate Matrix Multiply (SpAMM) is one of the…

Performance · Computer Science 2022-10-25 Xiaoyan Liu , Yi Liu , Ming Dun , Bohong Yin , Hailong Yang , Zhongzhi Luan , Depei Qian

We provide a preliminary study on utilizing GPU (Graphics Processing Unit) to accelerate computation for three simulation optimization tasks with either first-order or second-order algorithms. Compared to the implementation using only CPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-04-19 Jinghai He , Haoyu Liu , Yuhang Wu , Zeyu Zheng , Tingyu Zhu

Simulators are a primary tool in computer architecture research but are extremely computationally intensive. Simulating modern architectures with increased core counts and recent workloads can be challenging, even on modern hardware. This…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-27 Rodrigo Huerta , Antonio González

Stochastic simulation techniques employed for the analysis of portfolios of insurance/reinsurance risk, often referred to as `Aggregate Risk Analysis', can benefit from exploiting state-of-the-art high-performance computing platforms. In…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-08-19 A. K. Bahl , O. Baltzer , A. Rau-Chaplin , B. Varghese , A. Whiteway

We propose a GPU-accelerated distributed optimization algorithm for controlling multi-phase optimal power flow in active distribution systems with dynamically changing topologies. To handle varying network configurations and enable…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-01-15 Minseok Ryu , Geunyeong Byeon , Kibaek Kim

Due to decelerating gains in single-core CPU performance, computationally expensive simulations are increasingly executed on highly parallel hardware platforms. Agent-based simulations, where simulated entities act with a certain degree of…

Multiagent Systems · Computer Science 2018-07-04 Jiajian Xiao , Philipp Andelfinger , David Eckhoff , Wentong Cai , Alois Knoll

We consider deployment of the particle filter on modern massively parallel hardware architectures, such as Graphics Processing Units (GPUs), with a focus on the resampling stage. While standard multinomial and stratified resamplers require…

Computation · Statistics 2012-02-29 Lawrence Murray

We present speculative sampling, an algorithm for accelerating transformer decoding by enabling the generation of multiple tokens from each transformer call. Our algorithm relies on the observation that the latency of parallel scoring of…

Computation and Language · Computer Science 2023-02-03 Charlie Chen , Sebastian Borgeaud , Geoffrey Irving , Jean-Baptiste Lespiau , Laurent Sifre , John Jumper

As large language models (LLMs) scale out with tensor parallelism (TP) and pipeline parallelism (PP) and production stacks have aggressively optimized the data plane (attention/GEMM and KV cache), sampling, the decision plane that turns…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-02 Bohan Zhao , Zane Cao , Yongchao He

The optimal allocation of replicas to a homogeneous or heterogenous set of processors is derived for parallel tempering simulations on multi-processor machines. In the general case, it is possible without substantially increasing wall clock…

Computational Physics · Physics 2007-05-23 David J. Earl , Michael W. Deem

Speculative decoding has emerged as a powerful method to improve latency and throughput in hosting large language models. However, most existing implementations focus on generating a single sequence. Real-world generative AI applications…

Linear-time algorithms that are traditionally used to shuffle data on CPUs, such as the method of Fisher-Yates, are not well suited to implementation on GPUs due to inherent sequential dependencies, and existing parallel shuffling…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-02-04 Rory Mitchell , Daniel Stokes , Eibe Frank , Geoffrey Holmes

Real-time trajectory optimization for nonlinear constrained autonomous systems is critical and typically performed by CPU-based sequential solvers. Specifically, reliance on global sparse linear algebra or the serial nature of dynamic…

Robotics · Computer Science 2026-03-13 Yilin Zou , Zhong Zhang , Maxime Robic , Fanghua Jiang

Nowadays, several industrial applications are being ported to parallel architectures. In fact, these platforms allow acquire more performance for system modelling and simulation. In the electric machines area, there are many problems which…

Distributed, Parallel, and Cluster Computing · Computer Science 2010-10-25 Antonio Wendell De Oliveira Rodrigues , Frédéric Guyomarch , Yvonnick Le Menach , Jean-Luc Dekeyser
‹ Prev 1 2 3 10 Next ›