Related papers: Laconic Deep Learning Computing
We explore techniques to significantly improve the compute efficiency and performance of Deep Convolution Networks without impacting their accuracy. To improve the compute efficiency, we focus on achieving high accuracy with extremely…
Spatial-wise dynamic convolution has become a promising approach to improving the inference efficiency of deep networks. By allocating more computation to the most informative pixels, such an adaptive inference paradigm reduces the spatial…
Sparse neural networks can greatly facilitate the deployment of neural networks on resource-constrained platforms as they offer compact model sizes while retaining inference accuracy. Because of the sparsity in parameter matrices, sparse…
Decomposition is a proven way to shrink deep networks without changing input-output dimensionality or interface semantics. We bring this idea to hyperdimensional computing (HDC), where footprint cuts usually shrink the feature axis and…
Load balancers are pervasively used inside today's clouds to scalably distribute network requests across data center servers. Given the extensive use of load balancers and their associated operating costs, several efforts have focused on…
While dense retrieval models have become the standard for state-of-the-art information retrieval, their deployment is often constrained by high memory requirements and reliance on GPU accelerators for vector similarity search. Learned…
Optical and hybrid convolutional neural networks (CNNs) recently have become of increasing interest to achieve low-latency, low-power image classification and computer vision tasks. However, implementing optical nonlinearity is challenging,…
In many learning situations, resources at inference time are significantly more constrained than resources at training time. This paper studies a general paradigm, called Differentiable ARchitecture Compression (DARC), that combines model…
Porting state of the art deep learning algorithms to resource constrained compute platforms (e.g. VR, AR, wearables) is extremely challenging. We propose a fast, compact, and accurate model for convolutional neural networks that enables…
Recent trends show recognition accuracy increasing even more profoundly. Inference process of Deep Convolutional Neural Networks (DCNN) has a large number of parameters, requires a large amount of computation, and can be very slow. The…
As state of the art neural networks (NNs) continue to grow in size, their resource-efficient implementation becomes ever more important. In this paper, we introduce a compression scheme that reduces the number of computations required for…
Deep learning is highly pervasive in today's data-intensive era. In particular, convolutional neural networks (CNNs) are being widely adopted in a variety of fields for superior accuracy. However, computing deep CNNs on traditional CPUs and…
Deep networks are now able to achieve human-level performance on a broad spectrum of recognition tasks. Independently, neuromorphic computing has now demonstrated unprecedented energy-efficiency through a new chip architecture based on…
The computation and memory costs of large language models kept increasing over last decade, which reached over the scale of 1T parameters. To address the challenges from the large scale models, model compression techniques such as low-rank…
Transformer-based models are becoming more and more intelligent and are revolutionizing a wide range of human tasks. To support their deployment, AI labs offer inference services that consume hundreds of GWh of energy annually and charge…
We propose a Digital Neuron, a hardware inference accelerator for convolutional deep neural networks with integer inputs and integer weights for embedded systems. The main idea to reduce circuit area and power consumption is manipulating…
Deep convolution Neural Network (DCNN) has been widely used in computer vision tasks. However, for edge devices even inference has too large computational complexity and data access amount. The inference latency of state-of-the-art models…
Inference efficiency is the predominant consideration in designing deep learning accelerators. Previous work mainly focuses on skipping zero values to deal with remarkable ineffectual computation, while zero bits in non-zero values, as…
Deploying a deep learning model on mobile/IoT devices is a challenging task. The difficulty lies in the trade-off between computation speed and accuracy. A complex deep learning model with high accuracy runs slowly on resource-limited…
Recent success in deep neural networks has generated strong interest in hardware accelerators to improve speed and energy consumption. This paper presents a new type of photonic accelerator based on coherent detection that is scalable to…