Related papers: Hash Layers For Large Sparse Models

BASE Layers: Simplifying Training of Large, Sparse Models

We introduce a new balanced assignment of experts (BASE) layer for large language models that greatly simplifies existing high capacity sparse layers. Sparse layers can dramatically improve the efficiency of training and inference by…

Computation and Language · Computer Science 2021-04-01 Mike Lewis , Shruti Bhosale , Tim Dettmers , Naman Goyal , Luke Zettlemoyer

Sparse is Enough in Scaling Transformers

Large Transformer models yield impressive results on many tasks, but are expensive to train, or even fine-tune, and so slow at decoding that their use and study becomes out of reach. We address this problem by leveraging sparsity. We study…

Machine Learning · Computer Science 2021-11-29 Sebastian Jaszczur , Aakanksha Chowdhery , Afroz Mohiuddin , Łukasz Kaiser , Wojciech Gajewski , Henryk Michalewski , Jonni Kanerva

Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers

State-of-the-art results in large language models (LLMs) often rely on scale, which becomes computationally expensive. This has sparked a research agenda to reduce these models' parameter counts and computational costs without significantly…

Computation and Language · Computer Science 2024-11-07 Xiuying Wei , Skander Moalla , Razvan Pascanu , Caglar Gulcehre

Exploring Heterogeneous Characteristics of Layers in ASR Models for More Efficient Training

Transformer-based architectures have been the subject of research aimed at understanding their overparameterization and the non-uniform importance of their layers. Applying these approaches to Automatic Speech Recognition, we demonstrate…

Machine Learning · Computer Science 2022-02-07 Lillian Zhou , Dhruv Guliani , Andreas Kabel , Giovanni Motta , Françoise Beaufays

Training Neural Networks with Fixed Sparse Masks

During typical gradient-based training of deep neural networks, all of the model's parameters are updated at each iteration. Recent work has shown that it is possible to update only a small subset of the model's parameters during training,…

Machine Learning · Computer Science 2021-11-19 Yi-Lin Sung , Varun Nair , Colin Raffel

Large-Scale Distributed Learning via Private On-Device Locality-Sensitive Hashing

Locality-sensitive hashing (LSH) based frameworks have been used efficiently to select weight vectors in a dense hidden layer with high cosine similarity to an input, enabling dynamic pruning. While this type of scheme has been shown to…

Machine Learning · Computer Science 2023-06-06 Tahseen Rabbani , Marco Bornstein , Furong Huang

Towards A Unified View of Sparse Feed-Forward Network in Pretraining Large Language Model

Large and sparse feed-forward layers (S-FFN) such as Mixture-of-Experts (MoE) have proven effective in scaling up Transformers model size for \textit{pretraining} large language models. By only activating part of the FFN parameters…

Computation and Language · Computer Science 2023-10-25 Zeyu Leo Liu , Tim Dettmers , Xi Victoria Lin , Veselin Stoyanov , Xian Li

Sparser, Faster, Lighter Transformer Language Models

Scaling autoregressive large language models (LLMs) has driven unprecedented progress but comes with vast computational costs. In this work, we tackle these costs by leveraging unstructured sparsity within an LLM's feedforward layers, the…

Machine Learning · Computer Science 2026-05-11 Edoardo Cetin , Stefano Peluchetti , Emilio Castillo , Akira Naruse , Mana Murakami , Llion Jones

Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis

Efficient training and inference algorithms, such as low-rank adaption and model pruning, have shown impressive performance for learning Transformer-based large foundation models. However, due to the technical challenges of the non-convex…

Machine Learning · Computer Science 2024-06-26 Hongkang Li , Meng Wang , Shuai Zhang , Sijia Liu , Pin-Yu Chen

Sparse Networks from Scratch: Faster Training without Losing Performance

We demonstrate the possibility of what we call sparse learning: accelerated training of deep neural networks that maintain sparse weights throughout training while achieving dense performance levels. We accomplish this by developing sparse…

Machine Learning · Computer Science 2019-08-27 Tim Dettmers , Luke Zettlemoyer

Sparse Autoencoders Trained on the Same Data Learn Different Features

Sparse autoencoders (SAEs) are a useful tool for uncovering human-interpretable features in the activations of large language models (LLMs). While some expect SAEs to find the true underlying features used by a model, our research shows…

Machine Learning · Computer Science 2025-01-31 Gonçalo Paulo , Nora Belrose

Learning Multi-Layer Transform Models

Learned data models based on sparsity are widely used in signal processing and imaging applications. A variety of methods for learning synthesis dictionaries, sparsifying transforms, etc., have been proposed in recent years, often imposing…

Machine Learning · Computer Science 2018-10-22 Saiprasad Ravishankar , Brendt Wohlberg

Investigating Low-Rank Training in Transformer Language Models: Efficiency and Scaling Analysis

State-of-the-art LLMs often rely on scale with high computational costs, which has sparked a research agenda to reduce parameter counts and costs without significantly impacting performance. Our study focuses on Transformer-based LLMs,…

Computation and Language · Computer Science 2024-07-25 Xiuying Wei , Skander Moalla , Razvan Pascanu , Caglar Gulcehre

Memory Layers at Scale

Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsely activated memory layers complement compute-heavy dense feed-forward layers, providing dedicated…

Computation and Language · Computer Science 2024-12-23 Vincent-Pierre Berges , Barlas Oğuz , Daniel Haziza , Wen-tau Yih , Luke Zettlemoyer , Gargi Ghosh

Hyperparameter Transfer with Mixture-of-Expert Layers

Mixture-of-Experts (MoE) layers have emerged as an important tool in scaling up modern neural networks by decoupling total trainable parameters from activated parameters in the forward pass for each token. However, sparse MoEs add…

Machine Learning · Computer Science 2026-05-22 Tianze Jiang , Blake Bordelon , Cengiz Pehlevan , Boris Hanin

Layer-Wise Evolution of Representations in Fine-Tuned Transformers: Insights from Sparse AutoEncoders

Fine-tuning pre-trained transformers is a powerful technique for enhancing the performance of base models on specific tasks. From early applications in models like BERT to fine-tuning Large Language Models (LLMs), this approach has been…

Computation and Language · Computer Science 2025-02-25 Suneel Nadipalli

Can pruning make Large Language Models more efficient?

Transformer models have revolutionized natural language processing with their unparalleled ability to grasp complex contextual relationships. However, the vast number of parameters in these models has raised concerns regarding computational…

Machine Learning · Computer Science 2023-10-10 Sia Gholami , Marwan Omar

Sparse Random Networks for Communication-Efficient Federated Learning

One main challenge in federated learning is the large communication cost of exchanging weight updates from clients to the server at each round. While prior work has made great progress in compressing the weight updates through gradient…

Machine Learning · Computer Science 2023-02-10 Berivan Isik , Francesco Pase , Deniz Gunduz , Tsachy Weissman , Michele Zorzi

Allocation of Parameters in Transformers

Transformers have achieved remarkable successes across a wide range of applications, yet the theoretical foundation of their model efficiency remains underexplored. In this work, we investigate how the model parameters -- mainly attention…

Machine Learning · Computer Science 2025-10-07 Ruoxi Yu , Haotian Jiang , Jingpu Cheng , Penghao Yu , Qianxiao Li , Zhong Li

Unveiling Multilinguality in Transformer Models: Exploring Language Specificity in Feed-Forward Networks

Recent research suggests that the feed-forward module within Transformers can be viewed as a collection of key-value memories, where the keys learn to capture specific patterns from the input based on the training examples. The values then…

Computation and Language · Computer Science 2023-10-25 Sunit Bhattacharya , Ondrej Bojar