Related papers: Low-memory GEMM-based convolution algorithms for d…
Deep learning frameworks commonly implement convolution operators with GEMM-based algorithms. In these algorithms, convolution is implemented on top of matrix-matrix multiplication (GEMM) functions, provided by highly optimized BLAS…
Quantization has emerged to be an effective way to significantly boost the performance of deep neural networks (DNNs) by utilizing low-bit computations. Despite having lower numerical precision, quantized DNNs are able to reduce both memory…
Convolution is the most time-consuming operation in deep neural network operations, so its performance is critical to the overall performance of the neural network. The commonly used methods for convolution on GPU include the general matrix…
Convolution is a critical component in modern deep neural networks, thus several algorithms for convolution have been developed. Direct convolution is simple but suffers from poor performance. As an alternative, multiple indirect methods…
Sliding Window Sum algorithms have been successfully used for training and inference of Deep Neural Networks. We have shown before how both pooling and convolution 1-D primitives could be expressed as sliding sums and evaluated by the…
Convolutional neural networks (CNNs) have emerged as one of the most successful machine learning technologies for image and video processing. The most computationally intensive parts of CNNs are the convolutional layers, which convolve…
An important linear algebra routine, GEneral Matrix Multiplication (GEMM), is a fundamental operator in deep learning. Compilers need to translate these routines into low-level code optimized for specific hardware. Compiler-level…
Deep learning (DL) is one of the most prominent branches of machine learning. Due to the immense computational cost of DL workloads, industry and academia have developed DL libraries with highly-specialized kernels for each…
With recent trend of wearable devices and Internet of Things (IoTs), it becomes attractive to develop hardware-based deep convolutional neural networks (DCNNs) for embedded applications, which require low power/energy consumptions and small…
Convolution is the most expensive operation among neural network operations, thus its performance is critical to the overall performance of neural networks. Commonly used convolution approaches, including general matrix multiplication…
We introduce an incremental processing scheme for convolutional neural network (CNN) inference, targeted at embedded applications with limited memory budgets. Instead of processing layers one by one, individual input pixels are propagated…
DL inference queries play an important role in diverse internet services and a large fraction of datacenter cycles are spent on processing DL inference queries. Specifically, the matrix-matrix multiplication (GEMM) operations of…
Edge computing must be capable of executing computationally intensive algorithms, such as Deep Neural Networks (DNNs) while operating within a constrained computational resource budget. Such computations involve Matrix Vector…
GEneral Matrix Multiply (GEMM) is a central operation in deep learning and corresponds to the largest chunk of the compute footprint. Therefore, improving its efficiency is an active topic of ongoing research. A popular strategy is the use…
Deep neural networks (DNN) have shown superior performance in a variety of tasks. As they rapidly evolve, their escalating computation and memory demands make it challenging to deploy them on resource-constrained edge devices. Though…
The deployment of deep convolutional neural networks (CNNs) in many real world applications is largely hindered by their high computational cost. In this paper, we propose a novel learning scheme for CNNs to simultaneously 1) reduce the…
The inherent diversity of computation types within the deep neural network (DNN) models often requires a variety of specialized units in hardware processors, which limits computational efficiency, increasing both inference latency and power…
Winograd's minimal filtering algorithm has been widely used in Convolutional Neural Networks (CNNs) to reduce the number of multiplications for faster processing. However, it is only effective on convolutions with kernel size as 3x3 and…
Convolutional neural networks (CNNs) are used in many embedded applications, from industrial robotics and automation systems to biometric identification on mobile devices. State-of-the-art classification is typically achieved by large…
Deep neural networks (DNNs) face significant challenges when deployed on resource-constrained extreme edge devices due to their computational and data-intensive nature. While standalone accelerators tailored for specific application…