Related papers: Memory-efficient array redistribution through port…

Fast parallel multidimensional FFT using advanced MPI

We present a new method for performing global redistributions of multidimensional arrays essential to parallel fast Fourier (or similar) transforms. Traditional methods use standard all-to-all collective communication of contiguous memory…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-04-26 Lisandro Dalcin , Mikael Mortensen , David E Keyes

Simultaneous Inference for Massive Data: Distributed Bootstrap

In this paper, we propose a bootstrap method applied to massive data processed distributedly in a large number of machines. This new method is computationally efficient in that we bootstrap on the master machine without over-resampling,…

Machine Learning · Statistics 2020-02-21 Yang Yu , Shih-Kang Chao , Guang Cheng

Sparse Matrix to Matrix Multiplication: A Representation and Architecture for Acceleration (long version)

Accelerators for sparse matrix multiplication are important components in emerging systems. In this paper, we study the main challenges of accelerating Sparse Matrix Multiplication (SpMM). For the situations that data is not stored in the…

Hardware Architecture · Computer Science 2019-06-04 Pareesa Ameneh Golnari , Sharad Malik

Efficient Multidimensional Data Redistribution for Resizable Parallel Computations

Traditional parallel schedulers running on cluster supercomputers support only static scheduling, where the number of processors allocated to an application remains fixed throughout the execution of the job. This results in…

Distributed, Parallel, and Cluster Computing · Computer Science 2007-06-15 Rajesh Sudarsan , Calvin J. Ribbens

Efficient Distributed-Memory Parallel Matrix-Vector Multiplication with Wide or Tall Unstructured Sparse Matrices

This paper presents an efficient technique for matrix-vector and vector-transpose-matrix multiplication in distributed-memory parallel computing environments, where the matrices are unstructured, sparse, and have a substantially larger…

Mathematical Software · Computer Science 2018-12-04 Jonathan Eckstein , Gyorgy Matyasfalvi

Easy Acceleration with Distributed Arrays

High level programming languages and GPU accelerators are powerful enablers for a wide range of applications. Achieving scalable vertical (within a compute node), horizontal (across compute nodes), and temporal (over different generations…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-21 Jeremy Kepner , Chansup Byun , LaToya Anderson , William Arcand , David Bestor , William Bergeron , Alex Bonn , Daniel Burrill , Vijay Gadepally , Ryan Haney , Michael Houle , Matthew Hubbell , Hayden Jananthan , Michael Jones , Piotr Luszczek , Lauren Milechin , Guillermo Morales , Julie Mullen , Andrew Prout , Albert Reuther , Antonio Rosa , Charles Yee , Peter Michaleas

High Performance Parallel Sort for Shared and Distributed Memory MIMD

We present four high performance hybrid sorting methods developed for various parallel platforms: shared memory multiprocessors, distributed multiprocessors, and clusters taking advantage of existence of both shared and distributed memory.…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-03-04 Thoria Alghamdi , Gita Alaghband

HAP: SPMD DNN Training on Heterogeneous GPU Clusters with Automated Program Synthesis

Single-Program-Multiple-Data (SPMD) parallelism has recently been adopted to train large deep neural networks (DNNs). Few studies have explored its applicability on heterogeneous clusters, to fully exploit available resources for large…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-01-12 Shiwei Zhang , Lansong Diao , Chuan Wu , Zongyan Cao , Siyu Wang , Wei Lin

Distributed Caching for Complex Querying of Raw Arrays

As applications continue to generate multi-dimensional data at exponentially increasing rates, fast analytics to extract meaningful results is becoming extremely important. The database community has developed array databases that alleviate…

Databases · Computer Science 2018-03-19 Weijie Zhao , Florin Rusu , Bin Dong , Kesheng Wu , Anna Y. Q. Ho , Peter Nugent

Convergence Analysis of Distributed Stochastic Gradient Descent with Shuffling

When using stochastic gradient descent to solve large-scale machine learning problems, a common practice of data processing is to shuffle the training data, partition the data across multiple machines if needed, and then perform several…

Machine Learning · Statistics 2017-10-02 Qi Meng , Wei Chen , Yue Wang , Zhi-Ming Ma , Tie-Yan Liu

Learned spatial data partitioning

Due to the significant increase in the size of spatial data, it is essential to use distributed parallel processing systems to efficiently analyze spatial data. In this paper, we first study learned spatial data partitioning, which…

Databases · Computer Science 2023-06-21 Keizo Hori , Yuya Sasaki , Daichi Amagata , Yuki Murosaki , Makoto Onizuka

Efficient Distributed Learning with Sparsity

We propose a novel, efficient approach for distributed sparse learning in high-dimensions, where observations are randomly partitioned across machines. Computationally, at each round our method only requires the master machine to solve a…

Machine Learning · Statistics 2016-05-26 Jialei Wang , Mladen Kolar , Nathan Srebro , Tong Zhang

SplitBrain: Hybrid Data and Model Parallel Deep Learning

The recent success of deep learning applications has coincided with those widely available powerful computational resources for training sophisticated machine learning models with huge datasets. Nonetheless, training large models such as…

Machine Learning · Computer Science 2022-01-03 Farley Lai , Asim Kadav , Erik Kruus

Communication-Efficient, 2D Parallel Stochastic Gradient Descent for Distributed-Memory Optimization

Distributed-memory implementations of numerical optimization algorithm, such as stochastic gradient descent (SGD), require interprocessor communication at every iteration of the algorithm. On modern distributed-memory clusters where…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-01-14 Aditya Devarakonda , Ramakrishnan Kannan

DSP: Dynamic Sequence Parallelism for Multi-Dimensional Transformers

Scaling multi-dimensional transformers to long sequences is indispensable across various domains. However, the challenges of large memory requirements and slow speeds of such sequences necessitate sequence parallelism. All existing…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-13 Xuanlei Zhao , Shenggan Cheng , Chang Chen , Zangwei Zheng , Ziming Liu , Zheming Yang , Yang You

Graph Partitioning via Parallel Submodular Approximation to Accelerate Distributed Machine Learning

Distributed computing excels at processing large scale data, but the communication cost for synchronizing the shared parameters may slow down the overall performance. Fortunately, the interactions between parameter and data in many problems…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-05-19 Mu Li , Dave G. Andersen , Alexander J. Smola

Recursive Multi-Section on the Fly: Shared-Memory Streaming Algorithms for Hierarchical Graph Partitioning and Process Mapping

Partitioning a graph into balanced blocks such that few edges run between blocks is a key problem for large-scale distributed processing. A current trend for partitioning huge graphs are streaming algorithms, which use low computational…

Data Structures and Algorithms · Computer Science 2022-02-02 Marcelo Fonseca Faraj , Christian Schulz

An Asynchronous Distributed Framework for Large-scale Learning Based on Parameter Exchanges

In many distributed learning problems, the heterogeneous loading of computing machines may harm the overall performance of synchronous strategies. In this paper, we propose an effective asynchronous distributed framework for the…

Machine Learning · Statistics 2017-05-23 Bikash Joshi , Franck Iutzeler , Massih-Reza Amini

Transitive Array: An Efficient GEMM Accelerator with Result Reuse

Deep Neural Networks (DNNs) and Large Language Models (LLMs) have revolutionized artificial intelligence, yet their deployment faces significant memory and computational challenges, especially in resource-constrained environments.…

Hardware Architecture · Computer Science 2025-04-24 Cong Guo , Chiyue Wei , Jiaming Tang , Bowen Duan , Song Han , Hai Li , Yiran Chen

Efficient Distributed Transposition Of Large-Scale Multigraphs And High-Cardinality Sparse Matrices

Graph-based representations underlie a wide range of scientific problems. Graph connectivity is typically represented as a sparse matrix in the Compressed Sparse Row format. Large-scale graphs rely on distributed storage, allocating…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-12-14 Bruno Magalhaes , Felix Schürmann