English
Related papers

Related papers: Parallelizable Stack Long Short-Term Memory

200 papers

Recurrent neural networks (RNNs) have shown outstanding performance on processing sequence data. However, they suffer from long training time, which demands parallel implementations of the training procedure. Parallelization of the training…

Neural and Evolutionary Computing · Computer Science 2015-11-25 Kyuyeon Hwang , Wonyong Sung

Recently, machine learning methods have provided a broad spectrum of original and efficient algorithms based on Deep Neural Networks (DNN) to automatically predict an outcome with respect to a sequence of inputs. Recurrent hidden cells…

Machine Learning · Computer Science 2017-02-15 Mohamed Bouaziz , Mohamed Morchid , Richard Dufour , Georges Linarès , Renato De Mori

This paper presents the design, implementation, and evaluation of the PyTorch distributed data parallel module. PyTorch is a widely-adopted scientific computing package used in deep learning research and applications. Recent advances in…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-30 Shen Li , Yanli Zhao , Rohan Varma , Omkar Salpekar , Pieter Noordhuis , Teng Li , Adam Paszke , Jeff Smith , Brian Vaughan , Pritam Damania , Soumith Chintala

This work is concerned with the evaluation of the performance of parallelization of learning and tuning processes for image classification and large language models. For machine learning model in image recognition, various parallelization…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-20 Marcin Lawenda , Krzesimir Samborski , Kyrylo Khloponin , Łukasz Szustak

Since the advent of parallel algorithms in the C++17 Standard Template Library (STL), the STL has become a viable framework for creating performance-portable applications. Given multiple existing implementations of the parallel algorithms,…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-02-12 Ruben Laso , Diego Krupitza , Sascha Hunold

Transformer achieves promising results on various tasks. However, self-attention suffers from quadratic memory requirements with respect to the sequence length. Existing work focuses on reducing time and space complexity from an algorithm…

Machine Learning · Computer Science 2022-05-24 Shenggui Li , Fuzhao Xue , Chaitanya Baranwal , Yongbin Li , Yang You

Memory-based Temporal Graph Neural Networks are powerful tools in dynamic graph representation learning and have demonstrated superior performance in many real-world applications. However, their node memory favors smaller batch sizes to…

Machine Learning · Computer Science 2023-07-18 Hongkuan Zhou , Da Zheng , Xiang Song , George Karypis , Viktor Prasanna

Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on…

The pre-trained model (PTM) is revolutionizing Artificial Intelligence (AI) technology. However, the hardware requirement of PTM training is prohibitively high, making it a game for a small proportion of people. Therefore, we proposed…

Machine Learning · Computer Science 2022-11-11 Jiarui Fang , Zilin Zhu , Shenggui Li , Hui Su , Yang Yu , Jie Zhou , Yang You

GPUs have limited memory and it is difficult to train wide and/or deep models that cause the training process to go out of memory. It is shown in this paper how an open source tool called Large Model Support (LMS) can utilize a high…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-11-30 Samuel Matzek , Max Grossman , Minsik Cho , Anar Yusifov , Bryant Nelson , Amit Juneja

The rapid scaling of Large Language Models (LLMs) has pushed training workloads far beyond the limits of single-node analysis, demanding a deeper understanding of how these models behave across large-scale, multi-GPU systems. In this paper,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-22 Seokjin Go , Joongun Park , Spandan More , Hanjiang Wu , Irene Wang , Aaron Jezghani , Tushar Krishna , Divya Mahajan

The Simplex tableau has been broadly used and investigated in the industry and academia. With the advent of the big data era, ever larger problems are posed to be solved in ever larger machines whose architecture type did not exist in the…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-05-29 Demetrios Coutinho , Felipe O. Lins e Silva , Daniel Aloise , Samuel , Xavier-de-Souza

In this paper, we present PARTIME, a software library written in Python and based on PyTorch, designed specifically to speed up neural networks whenever data is continuously streamed over time, for both learning and inference. Existing…

Machine Learning · Computer Science 2022-12-05 Enrico Meloni , Lapo Faggi , Simone Marullo , Alessandro Betti , Matteo Tiezzi , Marco Gori , Stefano Melacci

Efficient parallelism is necessary for achieving low-latency, high-throughput inference with large language models (LLMs). Tensor parallelism (TP) is the state-of-the-art method for reducing LLM response latency, however GPU communications…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-27 Mert Hidayetoglu , Aurick Qiao , Michael Wyatt , Jeff Rasley , Yuxiong He , Samyam Rajbhandari

Long short-term memory (LSTM) is a robust recurrent neural network architecture for learning spatiotemporal sequential data. However, it requires significant computational power for learning and implementing from both software and hardware…

Machine Learning · Computer Science 2022-10-26 Nelly Elsayed , Zag ElSayed , Anthony S. Maida

Deep learning (DL) jobs use multi-dimensional parallelism, i.e. combining data, model, and pipeline parallelism, to use large GPU clusters efficiently. Long-running jobs may experience changes to their GPU allocation: (i) resource…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-27 Marcel Wagenländer , Guo Li , Bo Zhao , Luo Mai , Peter Pietzuch

We present a shared memory implementation of a parallel algorithm, called delta-stepping, for solving the single source shortest path problem for directed and undirected graphs. In order to reduce synchronization costs we make some…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-02-21 M. Kranjčević , D. Palossi , S. Pintarelli

Although recent scaling up approaches to training deep neural networks have proven to be effective, the computational intensity of large and complex models, as well as the availability of large-scale datasets, require deep learning…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-04-21 Bita Hasheminezhad , Shahrzad Shirzad , Nanmiao Wu , Patrick Diehl , Hannes Schulz , Hartmut Kaiser

The transformer is the most critical algorithm innovation of the Nature Language Processing (NLP) field in recent years. Unlike the Recurrent Neural Network (RNN) models, Transformers can process on dimensions of sequence lengths in…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-02-23 Jiarui Fang , Yang Yu , Chengduo Zhao , Jie Zhou

Large language models have high compute, latency, and memory requirements. While specialized accelerators such as GPUs and TPUs typically run these workloads, CPUs are more widely available and consume less energy. Accelerating LLMs with…

‹ Prev 1 2 3 10 Next ›