Related papers: Communication-Avoiding Parallel Algorithms for Sol…

A Reexamination of the Communication Bandwidth Cost Analysis of A Parallel Recursive Algorithm for Solving Triangular Systems of Linear Equations

This paper presents a reexamination of the research paper titled "Communication-Avoiding Parallel Algorithms for \proc{TRSM}" by Wicky et al. We focus on the communication bandwidth cost analysis presented in the original work and identify…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-07-02 Yuan Tang

Communication Lower Bounds and Optimal Algorithms for Multiple Tensor-Times-Matrix Computation

Multiple Tensor-Times-Matrix (Multi-TTM) is a key computation in algorithms for computing and operating with the Tucker tensor decomposition, which is frequently used in multidimensional data analysis. We establish communication lower…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-02-03 Hussam Al Daas , Grey Ballard , Laura Grigori , Suraj Kumar , Kathryn Rouse

Parallel dichotomy algorithm for solving tridiagonal SLAEs

A parallel algorithm for solving a series of matrix equations with a constant tridiagonal matrix and different right-hand sides is proposed and studied. The process of solving the problem is represented in two steps. The first preliminary…

Numerical Analysis · Mathematics 2010-12-07 Andrew Terekhov

Parallel Algorithms for Tensor Train Arithmetic

We present efficient and scalable parallel algorithms for performing mathematical operations for low-rank tensors represented in the tensor train (TT) format. We consider algorithms for addition, elementwise multiplication, computing norms…

Numerical Analysis · Mathematics 2021-09-08 Hussam Al Daas , Grey Ballard , Peter Benner

Optimal (Randomized) Parallel Algorithms in the Binary-Forking Model

In this paper we develop optimal algorithms in the binary-forking model for a variety of fundamental problems, including sorting, semisorting, list ranking, tree contraction, range minima, and ordered set union, intersection and difference.…

Data Structures and Algorithms · Computer Science 2020-06-26 Guy E. Blelloch , Jeremy T. Fineman , Yan Gu , Yihan Sun

Communication Lower Bounds and Optimal Algorithms for Symmetric Matrix Computations

In this article, we focus on the communication costs of three symmetric matrix computations: i) multiplying a matrix with its transpose, known as a symmetric rank-k update (SYRK) ii) adding the result of the multiplication of a matrix with…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-18 Hussam Al Daas , Grey Ballard , Laura Grigori , Suraj Kumar , Kathryn Rouse , Mathieu Verite

Parallel solver for shifted systems in a hybrid CPU-GPU framework

This paper proposes a combination of a hybrid CPU--GPU and a pure GPU software implementation of a direct algorithm for solving shifted linear systems $(A - \sigma I)X = B$ with large number of complex shifts $\sigma$ and multiple…

Mathematical Software · Computer Science 2017-08-24 Nela Bosner , Zvonimir Bujanović , Zlatko Drmač

Communication-avoiding Cholesky-QR2 for rectangular matrices

Scalable QR factorization algorithms for solving least squares and eigenvalue problems are critical given the increasing parallelism within modern machines. We introduce a more general parallelization of the CholeskyQR2 algorithm and show…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-06-18 Edward Hutter , Edgar Solomonik

Communication-Optimal Parallel Algorithm for Strassen's Matrix Multiplication

Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes…

Data Structures and Algorithms · Computer Science 2012-02-16 Grey Ballard , James Demmel , Olga Holtz , Benjamin Lipshitz , Oded Schwartz

Communication-optimal Parallel and Sequential Cholesky Decomposition

Numerical algorithms have two kinds of costs: arithmetic and communication, by which we mean either moving data between levels of a memory hierarchy (in the sequential case) or over a network connecting processors (in the parallel case).…

Numerical Analysis · Computer Science 2011-02-02 Grey Ballard , James Demmel , Olga Holtz , Oded Schwartz

Incremental tensor regularized least squares with multiple right-hand sides

Solving linear discrete ill-posed problems for third order tensor equations based on a tensor t-product has attracted much attention. But when the data tensor is produced continuously, current algorithms are not time-saving. Here, we…

Numerical Analysis · Mathematics 2021-11-30 Zhengbang Cao , Pengpeng Xie

Implementation of the Trigonometric LMS Algorithm using Original Cordic Rotation

The LMS algorithm is one of the most successful adaptive filtering algorithms. It uses the instantaneous value of the square of the error signal as an estimate of the mean-square error (MSE). The LMS algorithm changes (adapts) the filter…

Other Computer Science · Computer Science 2011-04-22 Nasrin Akhter , Kaniz Fatema , Lilatul Ferdouse , Faria Khandaker

A unified consensus-based parallel ADMM algorithm for high-dimensional regression with combined regularizations

The parallel alternating direction method of multipliers (ADMM) algorithm is widely recognized for its effectiveness in handling large-scale datasets stored in a distributed manner, making it a popular choice for solving statistical…

Machine Learning · Statistics 2023-11-22 Xiaofei Wu , Zhimin Zhang , Zhenyu Cui

Parallel and Communication Avoiding Least Angle Regression

We are interested in parallelizing the Least Angle Regression (LARS) algorithm for fitting linear regression models to high-dimensional data. We consider two parallel and communication avoiding versions of the basic LARS algorithm. The two…

Machine Learning · Computer Science 2020-09-15 S. Das , J. Demmel , K. Fountoulakis , L. Grigori , M. W. Mahoney , S. Yang

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem

Many large-scale scientific computations require eigenvalue solvers in a scaling regime where efficiency is limited by data movement. We introduce a parallel algorithm for computing the eigenvalues of a dense symmetric matrix, which…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-04-19 Edgar Solomonik , Grey Ballard , James Demmel , Torsten Hoefler

Enhancing the scalability and load balancing of the parallel selected inversion algorithm via tree-based asynchronous communication

We develop a method for improving the parallel scalability of the recently developed parallel selected inversion algorithm [Jacquelin, Lin and Yang 2014], named PSelInv, on massively parallel distributed memory machines. In the PSelInv…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-04-21 Mathias Jacquelin , Lin Lin , Nathan Wichmann , Chao Yang

ISO: Overlap of Computation and Communication within Seqenence For LLM Inference

In the realm of Large Language Model (LLM) inference, the inherent structure of transformer models coupled with the multi-GPU tensor parallelism strategy leads to a sequential execution of computation and communication. This results in…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-18 Bin Xiao , Lei Su

A Technique for Improving the Computation of Functions of Triangular Matrices

We propose a simple technique that, if combined with algorithms for computing functions of triangular matrices, can make them more efficient. Basically, such a technique consists in a specific scaling similarity transformation that reduces…

Numerical Analysis · Mathematics 2021-11-18 João R. Cardoso , Amir Sadeghi

Shift Parallelism: Low-Latency, High-Throughput LLM Inference for Dynamic Workloads

Efficient parallelism is necessary for achieving low-latency, high-throughput inference with large language models (LLMs). Tensor parallelism (TP) is the state-of-the-art method for reducing LLM response latency, however GPU communications…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-27 Mert Hidayetoglu , Aurick Qiao , Michael Wyatt , Jeff Rasley , Yuxiong He , Samyam Rajbhandari

Efficient Parallel Scheduling for Sparse Triangular Solvers

We develop and analyze new scheduling algorithms for solving sparse triangular linear systems (SpTRSV) in parallel. Our approach produces highly efficient synchronous schedules for the forward- and backward-substitution algorithm. Compared…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-06-06 Toni Böhnlein , Pál András Papp , Raphael S. Steiner , Christos K. Matzoros , A. N. Yzelman