Related papers: Parallelizable Stack Long Short-Term Memory

Single stream parallelization of generalized LSTM-like RNNs on a GPU

Recurrent neural networks (RNNs) have shown outstanding performance on processing sequence data. However, they suffer from long training time, which demands parallel implementations of the training procedure. Parallelization of the training…

Neural and Evolutionary Computing · Computer Science 2015-11-25 Kyuyeon Hwang , Wonyong Sung

Parallel Long Short-Term Memory for Multi-stream Classification

Recently, machine learning methods have provided a broad spectrum of original and efficient algorithms based on Deep Neural Networks (DNN) to automatically predict an outcome with respect to a sequence of inputs. Recurrent hidden cells…

Machine Learning · Computer Science 2017-02-15 Mohamed Bouaziz , Mohamed Morchid , Richard Dufour , Georges Linarès , Renato De Mori

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

This paper presents the design, implementation, and evaluation of the PyTorch distributed data parallel module. PyTorch is a widely-adopted scientific computing package used in deep learning research and applications. Recent advances in…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-30 Shen Li , Yanli Zhao , Rohan Varma , Omkar Salpekar , Pieter Noordhuis , Teng Li , Adam Paszke , Jeff Smith , Brian Vaughan , Pritam Damania , Soumith Chintala

Efficient allocation of image recognition and LLM tasks on multi-GPU system

This work is concerned with the evaluation of the performance of parallelization of learning and tuning processes for image classification and large language models. For machine learning model in image recognition, various parallelization…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-20 Marcin Lawenda , Krzesimir Samborski , Kyrylo Khloponin , Łukasz Szustak

pSTL-Bench: A Micro-Benchmark Suite for Assessing Scalability of C++ Parallel STL Implementations

Since the advent of parallel algorithms in the C++17 Standard Template Library (STL), the STL has become a viable framework for creating performance-portable applications. Given multiple existing implementations of the parallel algorithms,…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-02-12 Ruben Laso , Diego Krupitza , Sascha Hunold

Sequence Parallelism: Long Sequence Training from System Perspective

Transformer achieves promising results on various tasks. However, self-attention suffers from quadratic memory requirements with respect to the sequence length. Existing work focuses on reducing time and space complexity from an algorithm…

Machine Learning · Computer Science 2022-05-24 Shenggui Li , Fuzhao Xue , Chaitanya Baranwal , Yongbin Li , Yang You

DistTGL: Distributed Memory-Based Temporal Graph Neural Network Training

Memory-based Temporal Graph Neural Networks are powerful tools in dynamic graph representation learning and have demonstrated superior performance in many real-world applications. However, their node memory favors smaller batch sizes to…

Machine Learning · Computer Science 2023-07-18 Hongkuan Zhou , Da Zheng , Xiang Song , George Karypis , Viktor Prasanna

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on…

Computation and Language · Computer Science 2021-08-25 Deepak Narayanan , Mohammad Shoeybi , Jared Casper , Patrick LeGresley , Mostofa Patwary , Vijay Anand Korthikanti , Dmitri Vainbrand , Prethvi Kashinkunti , Julie Bernauer , Bryan Catanzaro , Amar Phanishayee , Matei Zaharia

PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management

The pre-trained model (PTM) is revolutionizing Artificial Intelligence (AI) technology. However, the hardware requirement of PTM training is prohibitively high, making it a game for a small proportion of people. Therefore, we proposed…

Machine Learning · Computer Science 2022-11-11 Jiarui Fang , Zilin Zhu , Shenggui Li , Hui Su , Yang Yu , Jie Zhou , Yang You

Data-parallel distributed training of very large models beyond GPU capacity

GPUs have limited memory and it is difficult to train wide and/or deep models that cause the training process to go out of memory. It is shown in this paper how an open source tool called Large Model Support (LMS) can utilize a high…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-11-30 Samuel Matzek , Max Grossman , Minsik Cho , Anar Yusifov , Bryant Nelson , Amit Juneja

Characterizing the Efficiency of Distributed Training: A Power, Performance, and Thermal Perspective

The rapid scaling of Large Language Models (LLMs) has pushed training workloads far beyond the limits of single-node analysis, demanding a deeper understanding of how these models behave across large-scale, multi-GPU systems. In this paper,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-22 Seokjin Go , Joongun Park , Spandan More , Hanjiang Wu , Irene Wang , Aaron Jezghani , Tushar Krishna , Divya Mahajan

A Scalable Shared-Memory Parallel Simplex for Large-Scale Linear Programming

The Simplex tableau has been broadly used and investigated in the industry and academia. With the advent of the big data era, ever larger problems are posed to be solved in ever larger machines whose architecture type did not exist in the…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-05-29 Demetrios Coutinho , Felipe O. Lins e Silva , Daniel Aloise , Samuel , Xavier-de-Souza

PARTIME: Scalable and Parallel Processing Over Time with Deep Neural Networks

In this paper, we present PARTIME, a software library written in Python and based on PyTorch, designed specifically to speed up neural networks whenever data is continuously streamed over time, for both learning and inference. Existing…

Machine Learning · Computer Science 2022-12-05 Enrico Meloni , Lapo Faggi , Simone Marullo , Alessandro Betti , Matteo Tiezzi , Marco Gori , Stefano Melacci

Shift Parallelism: Low-Latency, High-Throughput LLM Inference for Dynamic Workloads

Efficient parallelism is necessary for achieving low-latency, high-throughput inference with large language models (LLMs). Tensor parallelism (TP) is the state-of-the-art method for reducing LLM response latency, however GPU communications…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-27 Mert Hidayetoglu , Aurick Qiao , Michael Wyatt , Jeff Rasley , Yuxiong He , Samyam Rajbhandari

LiteLSTM Architecture for Deep Recurrent Neural Networks

Long short-term memory (LSTM) is a robust recurrent neural network architecture for learning spatiotemporal sequential data. However, it requires significant computational power for learning and implementing from both software and hardware…

Machine Learning · Computer Science 2022-10-26 Nelly Elsayed , Zag ElSayed , Anthony S. Maida

Tenplex: Dynamic Parallelism for Deep Learning using Parallelizable Tensor Collections

Deep learning (DL) jobs use multi-dimensional parallelism, i.e. combining data, model, and pipeline parallelism, to use large GPU clusters efficiently. Long-running jobs may experience changes to their GPU allocation: (i) resource…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-27 Marcel Wagenländer , Guo Li , Bo Zhao , Luo Mai , Peter Pietzuch

Parallel Delta-Stepping Algorithm for Shared Memory Architectures

We present a shared memory implementation of a parallel algorithm, called delta-stepping, for solving the single source shortest path problem for directed and undirected graphs. In order to reduce synchronization costs we make some…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-02-21 M. Kranjčević , D. Palossi , S. Pintarelli

Towards a Scalable and Distributed Infrastructure for Deep Learning Applications

Although recent scaling up approaches to training deep neural networks have proven to be effective, the computational intensity of large and complex models, as well as the availability of large-scale datasets, require deep learning…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-04-21 Bita Hasheminezhad , Shahrzad Shirzad , Nanmiao Wu , Patrick Diehl , Hannes Schulz , Hartmut Kaiser

TurboTransformers: An Efficient GPU Serving System For Transformer Models

The transformer is the most critical algorithm innovation of the Nature Language Processing (NLP) field in recent years. Unlike the Recurrent Neural Network (RNN) models, Transformers can process on dimensions of sequence lengths in…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-02-23 Jiarui Fang , Yang Yu , Chengduo Zhao , Jie Zhou

SparAMX: Accelerating Compressed LLMs Token Generation on AMX-powered CPUs

Large language models have high compute, latency, and memory requirements. While specialized accelerators such as GPUs and TPUs typically run these workloads, CPUs are more widely available and consume less energy. Accelerating LLMs with…

Machine Learning · Computer Science 2025-02-19 Ahmed F. AbouElhamayed , Jordan Dotzel , Yash Akhauri , Chi-Chih Chang , Sameh Gobriel , J. Pablo Muñoz , Vui Seng Chua , Nilesh Jain , Mohamed S. Abdelfattah