Related papers: RASA: Efficient Register-Aware Systolic Array Matr…
The computation and memory-intensive nature of DNNs limits their use in many mobile and embedded contexts. Application-specific integrated circuit (ASIC) hardware accelerators employ matrix multiplication units (such as the systolic arrays)…
Leveraging high degrees of unstructured sparsity is a promising approach to enhance the efficiency of deep neural network DNN accelerators - particularly important for emerging Edge-AI applications. We introduce VUSA, a systolic-array…
The memory capacity in edge devices is often limited due to constraints on cost, size, and power. Consequently, memory competition leads to inevitable page swapping in memory-constrained mixed-criticality edge devices, causing slow storage…
Transformer models rely heavily on the scaled dot-product attention (SDPA) operation, typically implemented as FlashAttention. Characterized by its frequent interleaving of matrix multiplications and softmax operations, FlashAttention fails…
Systolic arrays have been widely used for accelerating HPC and deep learning applications. There is a plethora of previous works on the performance tuning of systolic arrays, but usually based on a number of oversimplified assumptions…
The currently dominant AI/ML workloads, such as Large Language Models (LLMs), rely on the efficient execution of General Matrix-Matrix Multiplication (GEMM) operations. Thus, most systems are equipped with dedicated matrix hardware…
The paper discusses how Systolic Arrays can improve matrix multiplication for deep neural networks (DNNs). With AI models like OpenAI's GPT now containing trillions of parameters, the need for efficient matrix multiplication is more…
In High Performance Computing (HPC) infrastructures, the control of resources by batch systems can lead to prolonged queue waiting times and adverse effects on the overall execution times of applications, particularly in data-intensive and…
While Strassen's matrix multiplication algorithm reduces the complexity of naive matrix multiplication, general-purpose hardware is not suitable for achieving the algorithm's promised theoretical speedups. This leaves the question of if it…
RDMA is increasingly adopted by cloud computing platforms to provide low CPU overhead, low latency, high throughput network services. On the other hand, however, it is still challenging for developers to realize fast deployment of…
Sparse sensor arrays offer a cost effective alternative to uniform arrays. By utilizing the co-array, a sparse array can match the performance of a filled array, despite having significantly fewer sensors. However, even sparse arrays can…
The next generation of radar systems will include advanced digital front-end technology in the apertures allowing for spatially subdividing radar tasks over the array, the so-called split-aperture phased array (SAPA) concept. The goal of…
Redundancy elimination is a key optimization direction, and loop nests are the main optimization target in modern compilers. Previous work on redundancy elimination of array computations in loop nests lacks universality. These approaches…
R is a popular language and programming environment for data scientists. It is increasingly co-packaged with both relational and Hadoop-based data platforms and can often be the most dominant computational component in data analytics…
To enhance the resource scheduling performance of phased array radar, we propose a dynamic adaptive resource scheduling algorithm based on synthesis priorities and pulse interleaving. This approach addresses the challenges of low…
In this paper, the acceleration of algorithms using a design of a field programmable gate array (FPGA) as a prototype of a static dataflow architecture is discussed. The static dataflow architecture using operators interconnected by…
Increasing demands for computing power also propel the need for energy-efficient SoC accelerator architectures. One class of such accelerators are so-called processor arrays, which typically integrate a two-dimensional mesh of…
Flexible front-end technology will become available in future multifunction radar systems to improve adaptability to the operational theatre. A potential concept to utilize this flexibility is to subdivide radar tasks spatially over the…
Simulated annealing (SA) is a well-known algorithm for solving combinatorial optimization problems. However, the computation time of SA increases rapidly, as the size of the problem grows. Recently, a stochastic simulated annealing (SSA)…
Ising Machine is a promising computing approach for solving combinatorial optimization problems. It is naturally suited for energy-saving and compact in-memory computing implementations with emerging memories. A na\"ive in-memory computing…