Related papers: Stream-K Optimization and Exploration
We introduce Stream-K, a work-centric parallelization of matrix multiplication (GEMM) and related computations in dense linear algebra. Whereas contemporary decompositions are primarily tile-based, our method operates by partitioning an…
General matrix multiplication (GEMM) operations are the fundamental building blocks of computational domains including artificial intelligence (AI). As GPU architectures evolve and high-performance AI becomes increasingly important,…
Fine-grained workload and resource balancing is the key to high performance for regular and irregular computations on the GPUs. In this dissertation, we conduct an extensive survey of existing load-balancing techniques to build an…
There has been a recent explosion in the size of stored data, partially due to advances in storage technology, and partially due to the growing popularity of cloud-computing and the vast quantities of data generated. This motivates the need…
We design and implement two single-pass semi-streaming algorithms for the maximum weight $k$-disjoint matching ($k$-DM) problem. Given an integer $k$, the $k$-DM problem is to find $k$ pairwise edge-disjoint matchings such that the sum of…
We present data streaming algorithms for the $k$-median problem in high-dimensional dynamic geometric data streams, i.e. streams allowing both insertions and deletions of points from a discrete Euclidean space $\{1, 2, \ldots \Delta\}^d$.…
We introduce the poly-streaming model, a generalization of streaming models of computation in which $k$ processors process $k$ data streams containing a total of $N$ items. The algorithm is allowed $O\left(f(k)\cdot M_1\right)$ space, where…
General matrix-matrix multiplication (GEMM) is a cornerstone of AI computations, making tensor processing engines (TPEs) increasingly critical in GPUs and domain-specific architectures. Existing architectures primarily optimize dataflow or…
The amount of collected information on data repositories has vastly increased with the advent of the internet. It has become increasingly complex to deal with these massive data streams due to their sheer volume and the throughput of…
Sparse Matrix-Matrix multiplication is a key kernel that has applications in several domains such as scientific computing and graph analysis. Several algorithms have been studied in the past for this foundational kernel. In this paper, we…
We show that both clustering and subspace embeddings can be performed in the streaming model with the same asymptotic efficiency as in the central/offline setting. For $(k, z)$-clustering in the streaming model, we achieve a number of words…
The Incremental K-means (IKM), an improved version of K-means (KM), was introduced to improve the clustering quality of KM significantly. However, the speed of IKM is slower than KM. My thesis proposes two algorithms to speed up IKM while…
The devices designed for the Internet-of-Things encompass a large variety of distinct processor architectures, forming a highly heterogeneous zoo. In order to tackle this, we employ a simulator to estimate the performance of the…
The generic matrix multiply (GEMM) function is the core element of high-performance linear algebra libraries used in many computationally-demanding digital signal processing (DSP) systems. We propose an acceleration technique for GEMM based…
Sparse matrix-matrix multiplication (SpGEMM) is a widely used kernel in various graph, scientific computing and machine learning algorithms. It is well known that SpGEMM is a memory-bound operation, and its peak performance is expected to…
Many real-world applications pose challenges in incorporating fairness constraints into the $k$-center clustering problem, where the dataset consists of $m$ demographic groups, each with a specified upper bound on the number of centers to…
In the semi-streaming model, an algorithm receives a stream of edges of a graph in arbitrary order and uses a memory of size $O(n \mbox{ polylog } n)$, where $n$ is the number of vertices of a graph. In this work, we present semi-streaming…
Transformers have revolutionized AI in natural language processing and computer vision, but their large computation and memory demands pose major challenges for hardware acceleration. In practice, end-to-end throughput is often limited by…
In this paper we consider the problem of finding a maximum weight set subject to a $k$-extendible constraint in the data stream model. The only non-trivial algorithm known for this problem to date---to the best of our knowledge---is a…
Generalized sparse matrix-matrix multiplication (or SpGEMM) is a key primitive for many high performance graph algorithms as well as for some linear solvers, such as algebraic multigrid. Here we show that SpGEMM also yields efficient…