Related papers: From Sequential to Parallel: Reformulating Dynamic…

GPU-based Split algorithm for Large-Scale CVRPSD

Dynamic programming (DP) is a cornerstone of combinatorial optimization, yet its inherently sequential structure has long limited its scalability in scenario-based stochastic programming (SP). This paper introduces a GPU-accelerated…

Optimization and Control · Mathematics 2025-11-25 Jingyi Zhao , Linxin Yang , Haohua Zhang , Tian Ding

PAGANI: A Parallel Adaptive GPU Algorithm for Numerical

We present a new adaptive parallel algorithm for the challenging problem of multi-dimensional numerical integration on massively parallel architectures. Adaptive algorithms have demonstrated the best performance, but efficient many-core…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-06-24 Ioannis Sakiotis , Kamesh Arumugam , Marc Paterno , Desh Ranjan , Balša Terzić , Mohammad Zubair

Parallel-in-Time Nonlinear Optimal Control via GPU-native Sequential Convex Programming

Real-time trajectory optimization for nonlinear constrained autonomous systems is critical and typically performed by CPU-based sequential solvers. Specifically, reliance on global sparse linear algebra or the serial nature of dynamic…

Robotics · Computer Science 2026-03-13 Yilin Zou , Zhong Zhang , Maxime Robic , Fanghua Jiang

From Sequential Nodes to GPU Batches: Parallel Branch and Bound for Optimal $k$-Sparse GLMs

GPUs have significantly accelerated first-order methods for large-scale optimization, especially in continuous optimization. However, this success has not transferred cleanly to problems with discrete variables, combinatorial structure, and…

Machine Learning · Computer Science 2026-05-22 Jiachang Liu , Andrea Lodi

CPU- and GPU-Based Parallelization of the Robust Reference Governor

Constraint management is a central challenge in modern control systems. A solution is the Reference Governor (RG), which is an add-on strategy to pre-stabilized feedback control systems to enforce state and input constraints by shaping the…

Systems and Control · Electrical Eng. & Systems 2025-10-10 Hamid R. Ossareh , William Shayne , Samuel Chevalier

Modeling GPU Dynamic Parallelism for Self Similar Density Workloads

Dynamic Parallelism (DP) is a runtime feature of the GPU programming model that allows GPU threads to execute additional GPU kernels, recursively. Apart from making the programming of parallel hierarchical patterns easier, DP can also…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-06-07 Felipe A. Quezada , Cristóbal A. Navarro , Miguel Romero , Cristhian Aguilera

Towards a Linear-Algebraic Hypervisor

Many techniques in program synthesis, superoptimization, and array programming require parallel rollouts of general-purpose programs. GPUs, while capable targets for domain-specific parallelism, are traditionally underutilized by such…

Programming Languages · Computer Science 2026-04-15 Breandan Considine

Safe Large-Scale Robust Nonlinear MPC in Milliseconds via Reachability-Constrained System Level Synthesis on the GPU

We present GPU-SLS, a GPU-parallelized framework for safe, robust nonlinear model predictive control (MPC) that scales to high-dimensional uncertain robotic systems and long planning horizons. Our method jointly optimizes an…

Robotics · Computer Science 2026-04-10 Jeffrey Fang , Glen Chou

An Easy-to-use Scalable Framework for Parallel Recursive Backtracking

Supercomputers are equipped with an increasingly large number of cores to use computational power as a way of solving problems that are otherwise intractable. Unfortunately, getting serial algorithms to run in parallel to take advantage of…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-12-31 Faisal N. Abu-Khzam , Khuzaima Daudjee , Amer E. Mouawad , Naomi Nishimura

Kernel machines that adapt to GPUs for effective large batch training

Modern machine learning models are typically trained using Stochastic Gradient Descent (SGD) on massively parallel computing resources such as GPUs. Increasing mini-batch size is a simple and direct way to utilize the parallel computing…

Machine Learning · Statistics 2019-03-05 Siyuan Ma , Mikhail Belkin

Differentiable Particle Optimization for Fast Sequential Manipulation

Sequential robot manipulation tasks require finding collision-free trajectories that satisfy geometric constraints across multiple object interactions in potentially high-dimensional configuration spaces. Solving these problems in real-time…

Robotics · Computer Science 2025-10-14 Lucas Chen , Shrutheesh Raman Iyer , Zachary Kingston

GPU-Accelerated SPOCK for Scenario-Based Risk-Averse Optimal Control Problems

This paper presents a GPU-accelerated implementation of the SPOCK algorithm, a proximal method designed for solving scenario-based risk-averse optimal control problems. The proposed implementation leverages the massive parallelization of…

Optimization and Control · Mathematics 2025-05-20 Ruairi Moran , Pantelis Sopasakis

A Distributed Framework for Causal Modeling of Performance Variability in GPU Traces

Large-scale GPU traces play a critical role in identifying performance bottlenecks within heterogeneous High-Performance Computing (HPC) architectures. However, the sheer volume and complexity of a single trace of data make performance…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-22 Ankur Lahiry , Ayush Pokharel , Banooqa Banday , Seth Ockerman , Amal Gueroudji , Mohammad Zaeed , Tanzima Z. Islam , Line Pouchard

Sgap: Towards Efficient Sparse Tensor Algebra Compilation for GPU

Sparse compiler is a promising solution for sparse tensor algebra optimization. In compiler implementation, reduction in sparse-dense hybrid algebra plays a key role in performance. Though GPU provides various reduction semantics that can…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-01-10 Genghan Zhang , Yuetong Zhao , Yanting Tao , Zhongming Yu , Guohao Dai , Sitao Huang , Yuan Wen , Pavlos Petoumenos , Yu Wang

SPARe: Stacked Parallelism with Adaptive Reordering for Fault-Tolerant LLM Pretraining Systems with 100k+ GPUs

In large-scale LLM pre-training systems with 100k+ GPUs, failures become the norm rather than the exception, and restart costs can dominate wall-clock training time. However, existing fault-tolerance mechanisms are largely unprepared for…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-29 Jin Lee , Zhonghao Chen , Xuhang He , Robert Underwood , Bogdan Nicolae , Franck Cappello , Xiaoyi Lu , Sheng Di , Zheng Zhang

A mechanism for balancing accuracy and scope in cross-machine black-box GPU performance modeling

The ability to model, analyze, and predict execution time of computations is an important building block supporting numerous efforts, such as load balancing, performance optimization, and automated performance tuning for high performance,…

Performance · Computer Science 2020-06-22 James D. Stevens , Andreas Klöckner

SuperScaler: Supporting Flexible DNN Parallelization via a Unified Abstraction

With the growing model size, deep neural networks (DNN) are increasingly trained over massive GPU accelerators, which demands a proper parallelization plan that transforms a DNN model into fine-grained tasks and then schedules them to GPUs…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-01-24 Zhiqi Lin , Youshan Miao , Guodong Liu , Xiaoxiang Shi , Quanlu Zhang , Fan Yang , Saeed Maleki , Yi Zhu , Xu Cao , Cheng Li , Mao Yang , Lintao Zhang , Lidong Zhou

Multistep schemes for solving backward stochastic differential equations on GPU

The goal of this work is to parallelize the multistep scheme for the numerical approximation of the backward stochastic differential equations (BSDEs) in order to achieve both, a high accuracy and a reduction of the computation time as…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-04-18 Lorenc Kapllani , Long Teng

A Versatile Software Systolic Execution Model for GPU Memory-Bound Kernels

This paper proposes a versatile high-performance execution model, inspired by systolic arrays, for memory-bound regular kernels running on CUDA-enabled GPUs. We formulate a systolic model that shifts partial sums by CUDA warp primitives for…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-09-09 Peng Chen , Mohamed Wahib , Shinichiro Takizawa , Ryousei Takano , Satoshi Matsuoka

A Fast and Generic GPU-Based Parallel Reduction Implementation

Reduction operations are extensively employed in many computational problems. A reduction consists of, given a finite set of numeric elements, combining into a single value all elements in that set, using for this a combiner function. A…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-10-23 Walid Jradi , Hugo do Nascimento , Wellington Martins