Related papers: Stream-K Optimization and Exploration

Stream-K: Work-centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU

We introduce Stream-K, a work-centric parallelization of matrix multiplication (GEMM) and related computations in dense linear algebra. Whereas contemporary decompositions are primarily tile-based, our method operates by partitioning an…

Data Structures and Algorithms · Computer Science 2023-01-11 Muhammad Osama , Duane Merrill , Cris Cecka , Michael Garland , John D. Owens

Stream-K++: Adaptive GPU GEMM Kernel Scheduling and Selection using Bloom Filters

General matrix multiplication (GEMM) operations are the fundamental building blocks of computational domains including artificial intelligence (AI). As GPU architectures evolve and high-performance AI becomes increasingly important,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-26 Harisankar Sadasivan , Muhammed Emin Ozturk , Muhammad Osama , Chris Millette , Astha Rai , Maksim Podkorytov , John Afaganis , Carlus Huang , Jing Zhang , Jun Liu

GPU Load Balancing

Fine-grained workload and resource balancing is the key to high performance for regular and irregular computations on the GPUs. In this dissertation, we conduct an extensive survey of existing load-balancing techniques to build an…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-12-20 Muhammad Osama

Streaming Balanced Graph Partitioning for Random Graphs

There has been a recent explosion in the size of stored data, partially due to advances in storage technology, and partially due to the growing popularity of cloud-computing and the vast quantities of data generated. This motivates the need…

Data Structures and Algorithms · Computer Science 2012-12-06 Isabelle Stanton

Semi-Streaming Algorithms for Weighted $k$-Disjoint Matchings

We design and implement two single-pass semi-streaming algorithms for the maximum weight $k$-disjoint matching ($k$-DM) problem. Given an integer $k$, the $k$-DM problem is to find $k$ pairwise edge-disjoint matchings such that the sum of…

Data Structures and Algorithms · Computer Science 2024-07-09 S M Ferdous , Bhargav Samineni , Alex Pothen , Mahantesh Halappanavar , Bala Krishnamoorthy

Clustering High Dimensional Dynamic Data Streams

We present data streaming algorithms for the $k$-median problem in high-dimensional dynamic geometric data streams, i.e. streams allowing both insertions and deletions of points from a discrete Euclidean space $\{1, 2, \ldots \Delta\}^d$.…

Data Structures and Algorithms · Computer Science 2017-06-14 Vladimir Braverman , Gereon Frahling , Harry Lang , Christian Sohler , Lin F. Yang

Weighted Matching in a Poly-Streaming Model

We introduce the poly-streaming model, a generalization of streaming models of computation in which $k$ processors process $k$ data streams containing a total of $N$ items. The algorithm is allowed $O\left(f(k)\cdot M_1\right)$ space, where…

Data Structures and Algorithms · Computer Science 2025-07-21 Ahammed Ullah , S. M. Ferdous , Alex Pothen

Exploring the Performance Improvement of Tensor Processing Engines through Transformation in the Bit-weight Dimension of MACs

General matrix-matrix multiplication (GEMM) is a cornerstone of AI computations, making tensor processing engines (TPEs) increasingly critical in GPUs and domain-specific architectures. Existing architectures primarily optimize dataflow or…

Hardware Architecture · Computer Science 2025-03-11 Qizhe Wu , Huawen Liang , Yuchen Gui , Zhichen Zeng , Zerong He , Linfeng Tao , Xiaotian Wang , Letian Zhao , Zhaoxi Zeng , Wei Yuan , Wei Wu , Xi Jin

kMatrix: A Space Efficient Streaming Graph Summarization Technique

The amount of collected information on data repositories has vastly increased with the advent of the internet. It has become increasingly complex to deal with these massive data streams due to their sheer volume and the throughput of…

Information Retrieval · Computer Science 2021-05-13 Oshan Mudannayake , Nalin Ranasinghe

Multi-threaded Sparse Matrix-Matrix Multiplication for Many-Core and GPU Architectures

Sparse Matrix-Matrix multiplication is a key kernel that has applications in several domains such as scientific computing and graph analysis. Several algorithms have been studied in the past for this foundational kernel. In this paper, we…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-01-10 Mehmet Deveci , Christian Trott , Sivasankaran Rajamanickam

Fast, Space-Optimal Streaming Algorithms for Clustering and Subspace Embeddings

We show that both clustering and subspace embeddings can be performed in the streaming model with the same asymptotic efficiency as in the central/offline setting. For $(k, z)$-clustering in the streaming model, we achieve a number of words…

Data Structures and Algorithms · Computer Science 2025-04-24 Vincent Cohen-Addad , Liudeng Wang , David P. Woodruff , Samson Zhou

Improving The Performance Of The K-means Algorithm

The Incremental K-means (IKM), an improved version of K-means (KM), was introduced to improve the clustering quality of KM significantly. However, the speed of IKM is slower than KM. My thesis proposes two algorithms to speed up IKM while…

Machine Learning · Computer Science 2020-05-12 Tien-Dung Nguyen

Performance Analysis of Matrix Multiplication for Deep Learning on the Edge

The devices designed for the Internet-of-Things encompass a large variety of distinct processor architectures, forming a highly heterogeneous zoo. In order to tackle this, we employ a simulator to estimate the performance of the…

Hardware Architecture · Computer Science 2024-03-13 Cristian Ramírez , Adrián Castelló , Héctor Martínez , Enrique S. Quintana-Ortí

Throughput-Distortion Computation Of Generic Matrix Multiplication: Toward A Computation Channel For Digital Signal Processing Systems

The generic matrix multiply (GEMM) function is the core element of high-performance linear algebra libraries used in many computationally-demanding digital signal processing (DSP) systems. We propose an acceleration technique for GEMM based…

Mathematical Software · Computer Science 2015-05-30 Davide Anastasia , Yiannis Andreopoulos

Bandwidth-Optimized Parallel Algorithms for Sparse Matrix-Matrix Multiplication using Propagation Blocking

Sparse matrix-matrix multiplication (SpGEMM) is a widely used kernel in various graph, scientific computing and machine learning algorithms. It is well known that SpGEMM is a memory-bound operation, and its peak performance is expected to…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-02-27 Zhixiang Gu , Jose Moreira , David Edelsohn , Ariful Azad

Improved Streaming Algorithm for Fair $k$-Center Clustering

Many real-world applications pose challenges in incorporating fairness constraints into the $k$-center clustering problem, where the dataset consists of $m$ demographic groups, each with a specified upper bound on the number of centers to…

Data Structures and Algorithms · Computer Science 2026-01-19 Longkun Guo , Zeyu Lin , Chaoqi Jia , Chao Chen

Maximum Matching in Semi-Streaming with Few Passes

In the semi-streaming model, an algorithm receives a stream of edges of a graph in arbitrary order and uses a memory of size $O(n \mbox{ polylog } n)$, where $n$ is the number of vertices of a graph. In this work, we present semi-streaming…

Data Structures and Algorithms · Computer Science 2014-04-11 Christian Konrad , Frédéric Magniez , Claire Mathieu

Mitigating the Bandwidth Wall via Data-Streaming System-Accelerator Co-Design

Transformers have revolutionized AI in natural language processing and computer vision, but their large computation and memory demands pose major challenges for hardware acceleration. In practice, end-to-end throughput is often limited by…

Hardware Architecture · Computer Science 2026-03-20 Qunyou Liu , Marina Zapater , David Atienza

Almost Optimal Semi-streaming Maximization for k-Extendible Systems

In this paper we consider the problem of finding a maximum weight set subject to a $k$-extendible constraint in the data stream model. The only non-trivial algorithm known for this problem to date---to the best of our knowledge---is a…

Data Structures and Algorithms · Computer Science 2019-06-12 Moran Feldman , Ran Haba

Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments

Generalized sparse matrix-matrix multiplication (or SpGEMM) is a key primitive for many high performance graph algorithms as well as for some linear solvers, such as algebraic multigrid. Here we show that SpGEMM also yields efficient…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-03-19 Aydin Buluc , John Gilbert