Related papers: Modular Duality in Deep Learning
Scaling model capacity has been vital in the success of deep learning. For a typical network, necessary compute resources and training time grow dramatically with model size. Conditional computation is a promising way to increase the number…
Traditional deep network training methods optimize a monolithic objective function jointly for all the components. This can lead to various inefficiencies in terms of potential parallelization. Local learning is an approach to…
Training deep neural networks is a challenging non-convex optimization problem. Recent work has proven that the strong duality holds (which means zero duality gap) for regularized finite-width two-layer ReLU networks and consequently…
Modularity is a general principle present in many fields. It offers attractive advantages, including, among others, ease of conceptualization, interpretability, scalability, module combinability, and module reusability. The deep learning…
Modular neural networks outperform nonmodular neural networks on tasks ranging from visual question answering to robotics. These performance improvements are thought to be due to modular networks' superior ability to model the compositional…
The learned weights of a neural network are often considered devoid of scrutable internal structure. To discern structure in these weights, we introduce a measurable notion of modularity for multi-layer perceptrons (MLPs), and investigate…
The success of deep neural networks is in part due to the use of normalization layers. Normalization layers like Batch Normalization, Layer Normalization and Weight Normalization are ubiquitous in practice, as they improve generalization…
Training a Neural Network (NN) with lots of parameters or intricate architectures creates undesired phenomena that complicate the optimization process. To address this issue we propose a first modular approach to NN design, wherein the NN…
Transfer learning has recently become the dominant paradigm of machine learning. Pre-trained models fine-tuned for downstream tasks achieve better performance with fewer labelled examples. Nonetheless, it remains unclear how to develop…
For most deep learning algorithms training is notoriously time consuming. Since most of the computation in training neural networks is typically spent on floating point multiplications, we investigate an approach to training that eliminates…
Deep learning has been widely used for supervised learning and classification/regression problems. Recently, a novel area of research has applied this paradigm to unsupervised tasks; indeed, a gradient-based approach extracts, efficiently…
We investigate deep morphological neural networks (DMNNs). We demonstrate that despite their inherent non-linearity, "linear" activations are essential for DMNNs. To preserve their inherent sparsity, we propose architectures that constraint…
Linear layers in neural networks (NNs) trained by gradient descent can be expressed as a key-value memory system which stores all training datapoints and the initial weights, and produces outputs using unnormalised dot attention over the…
Modularity has been widely studied as a mechanism to improve the capabilities of neural networks through various techniques such as hand-crafted modular architectures and automatic approaches. While these methods have sometimes shown…
To improve performance in contemporary deep learning, one is interested in scaling up the neural network in terms of both the number and the size of the layers. When ramping up the width of a single layer, graceful scaling of training has…
Generalization of deep neural networks remains one of the main open problems in machine learning. Previous theoretical works focused on deriving tight bounds of model complexity, while empirical works revealed that neural networks exhibit…
We propose multirate training of neural networks: partitioning neural network parameters into "fast" and "slow" parts which are trained on different time scales, where slow parts are updated less frequently. By choosing appropriate…
We present a novel regularization approach to train neural networks that enjoys better generalization and test error than standard stochastic gradient descent. Our approach is based on the principles of cross-validation, where a validation…
With the accumulation of resources in the era of big data and the rise of pre-trained models in deep learning, optimizing neural networks for various tasks often involves different strategies for fine-tuning pre-trained models versus…
In low-latency or mobile applications, lower computation complexity, lower memory footprint and better energy efficiency are desired. Many prior works address this need by removing redundant parameters. Parameter quantization replaces…