Related papers: Optimizing Data Collection in Deep Reinforcement L…

Accelerated Methods for Deep Reinforcement Learning

Deep reinforcement learning (RL) has achieved many recent successes, yet experiment turn-around time remains a key bottleneck in research and in practice. We investigate how to optimize existing deep RL algorithms for modern computers,…

Machine Learning · Computer Science 2019-01-14 Adam Stooke , Pieter Abbeel

GPU-Accelerated Robotic Simulation for Distributed Reinforcement Learning

Most Deep Reinforcement Learning (Deep RL) algorithms require a prohibitively large number of training samples for learning complex tasks. Many recent works on speeding up Deep RL have focused on distributed training and simulation. While…

Robotics · Computer Science 2018-10-25 Jacky Liang , Viktor Makoviychuk , Ankur Handa , Nuttapong Chentanez , Miles Macklin , Dieter Fox

GPU-Accelerated Primal Learning for Extremely Fast Large-Scale Classification

One of the most efficient methods to solve L2-regularized primal problems, such as logistic regression and linear support vector machine (SVM) classification, is the widely used trust region Newton algorithm, TRON. While TRON has recently…

Machine Learning · Computer Science 2020-10-16 John T. Halloran , David M. Rocke

Optimizing CUDA Code By Kernel Fusion---Application on BLAS

Modern GPUs are able to perform significantly more arithmetic operations than transfers of a single word to or from global memory. Hence, many GPU kernels are limited by memory bandwidth and cannot exploit the arithmetic power of GPUs.…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-09-13 J. Filipovič , M. Madzin , J. Fousek , L. Matyska

Large Batch Simulation for Deep Reinforcement Learning

We accelerate deep reinforcement learning-based training in visually complex 3D environments by two orders of magnitude over prior work, realizing end-to-end training speeds of over 19,000 frames of experience per second on a single GPU and…

Machine Learning · Computer Science 2021-03-15 Brennan Shacklett , Erik Wijmans , Aleksei Petrenko , Manolis Savva , Dhruv Batra , Vladlen Koltun , Kayvon Fatahalian

Deep Learning Models on CPUs: A Methodology for Efficient Training

GPUs have been favored for training deep learning models due to their highly parallelized architecture. As a result, most studies on training optimization focus on GPUs. There is often a trade-off, however, between cost and efficiency when…

Machine Learning · Computer Science 2023-06-21 Quchen Fu , Ramesh Chukka , Keith Achorn , Thomas Atta-fosu , Deepak R. Canchi , Zhongwei Teng , Jules White , Douglas C. Schmidt

Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform

The training process of Deep Neural Network (DNN) is compute-intensive, often taking days to weeks to train a DNN model. Therefore, parallel execution of DNN training on GPUs is a widely adopted approach to speed up the process nowadays.…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-29 Chi-Chung Chen , Chia-Lin Yang , Hsiang-Yun Cheng

Optimizing Performance of Recurrent Neural Networks on GPUs

As recurrent neural networks become larger and deeper, training times for single networks are rising into weeks or even months. As such there is a significant incentive to improve the performance and scalability of these networks. While…

Machine Learning · Computer Science 2016-04-08 Jeremy Appleyard , Tomas Kocisky , Phil Blunsom

Importance of Explicit Vectorization for CPU and GPU Software Performance

Much of the current focus in high-performance computing is on multi-threading, multi-computing, and graphics processing unit (GPU) computing. However, vectorization and non-parallel optimization techniques, which can often be employed…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-05-18 Neil G. Dickson , Kamran Karimi , Firas Hamze

Kernel Fusion in Atomistic Spin Dynamics Simulations on Nvidia GPUs using Tensor Core

In atomistic spin dynamics simulations, the time cost of constructing the space- and time-displaced pair correlation function in real space increases quadratically as the number of spins $N$, leading to significant computational effort. The…

Computational Physics · Physics 2023-08-16 Hongwei Chen , Shiyang Chen , Joshua J. Turner , Adrian Feiguin

Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training

Deploying deep learning (DL) models across multiple compute devices to train large and complex models continues to grow in importance because of the demand for faster and more frequent training. Data parallelism (DP) is the most widely used…

Machine Learning · Computer Science 2022-11-08 Saptadeep Pal , Eiman Ebrahimi , Arslan Zulfiqar , Yaosheng Fu , Victor Zhang , Szymon Migacz , David Nellans , Puneet Gupta

Accelerating Visual-Policy Learning through Parallel Differentiable Simulation

In this work, we propose a computationally efficient algorithm for visual policy learning that leverages differentiable simulation and first-order analytical policy gradients. Our approach decouple the rendering process from the computation…

Machine Learning · Computer Science 2025-11-12 Haoxiang You , Yilang Liu , Ian Abraham

Democratizing AI: A Comparative Study in Deep Learning Efficiency and Future Trends in Computational Processing

The exponential growth in data has intensified the demand for computational power to train large-scale deep learning models. However, the rapid growth in model size and complexity raises concerns about equal and fair access to computational…

Performance · Computer Science 2026-04-03 Lisan Al Amin , Md Ismail Hossain , Rupak Kumar Das , Mahbubul Islam , Abdulaziz Tabbakh

Analyzing Molecular Simulations Trajectories by Utilizing CUDA on GPU Architecture

With the advent of high-performance computing techniques, the data for analysis has grown significantly. Here, graphic processing unit (GPU) based program kernels are discussed to exploit parallelism in the analysis codes specific to…

Computational Physics · Physics 2018-11-07 Gourav Shrivastav , Manish Agarwal

NeuroVectorizer: End-to-End Vectorization with Deep Reinforcement Learning

One of the key challenges arising when compilers vectorize loops for today's SIMD-compatible architectures is to decide if vectorization or interleaving is beneficial. Then, the compiler has to determine how many instructions to pack…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-01-07 Ameer Haj-Ali , Nesreen K. Ahmed , Ted Willke , Sophia Shao , Krste Asanovic , Ion Stoica

Efficient Use of Limited-Memory Accelerators for Linear Learning on Heterogeneous Systems

We propose a generic algorithmic building block to accelerate training of machine learning models on heterogeneous compute systems. Our scheme allows to efficiently employ compute accelerators such as GPUs and FPGAs for the training of…

Machine Learning · Computer Science 2017-11-08 Celestine Dünner , Thomas Parnell , Martin Jaggi

PlexRL: Cluster-Level Orchestration of Serviceized LLM Execution for RLVR

Reinforcement learning with verifiable rewards (RLVR) has recently unlocked strong reasoning capabilities in large language models (LLMs), triggering rapid exploration of new algorithms and data. However, RLVR training is notoriously…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-21 Yiqi Zhang , Fangzheng Jiao , Tian Tang , Boyu Tian , Hangyu Wang , Qiaoling Chen , Guoteng Wang , Zhen Jiang , Peng Sun , Ping Zhang , Xiaohe Hu , Ziming Liu , Menghao Zhang , Yanmin Jia , Yang You , Siyuan Feng

GCL-Sampler: Discovering Kernel Similarity for Sampled GPU Simulation via Graph Contrastive Learning

GPU architectural simulation is orders of magnitude slower than native execution, necessitating workload sampling for practical speedups. Existing methods rely on hand-crafted features with limited expressiveness, yielding either aggressive…

Performance · Computer Science 2026-03-03 Jiaqi Wang , Jingwei Sun , Jiyu Luo , Han Li , Guangzhong Sun

Operator Fusion in XLA: Analysis and Evaluation

Machine learning (ML) compilers are an active area of research because they offer the potential to automatically speedup tensor programs. Kernel fusion is often cited as an important optimization performed by ML compilers. However, there…

Machine Learning · Computer Science 2023-01-31 Daniel Snider , Ruofan Liang

Performance Acceleration of Kernel Polynomial Method Applying Graphics Processing Units

The Kernel Polynomial Method (KPM) is one of the fast diagonalization methods used for simulations of quantum systems in research fields of condensed matter physics and chemistry. The algorithm has a difficulty to be parallelized on a…

Computational Physics · Physics 2011-05-30 Shixun Zhang , Shinichi Yamagiwa , Masahiko Okumura , Seiji Yunoki