Related papers: MicroNet for Efficient Language Modeling

Efficient softmax approximation for GPUs

We propose an approximate strategy to efficiently train neural network based language models over very large vocabularies. Our approach, called adaptive softmax, circumvents the linear dependency on the vocabulary size by exploiting the…

Computation and Language · Computer Science 2017-06-20 Edouard Grave , Armand Joulin , Moustapha Cissé , David Grangier , Hervé Jégou

Adaptive Input Representations for Neural Language Modeling

We introduce adaptive input representations for neural language modeling which extend the adaptive softmax of Grave et al. (2017) to input representations of variable capacity. There are several choices on how to factorize the input and…

Computation and Language · Computer Science 2019-02-26 Alexei Baevski , Michael Auli

Breaking the Softmax Bottleneck: A High-Rank RNN Language Model

We formulate language modeling as a matrix factorization problem, and show that the expressiveness of Softmax-based models (including the majority of neural language models) is limited by a Softmax bottleneck. Given that natural language is…

Computation and Language · Computer Science 2018-03-06 Zhilin Yang , Zihang Dai , Ruslan Salakhutdinov , William W. Cohen

Efficient Language Modeling for Low-Resource Settings with Hybrid RNN-Transformer Architectures

Transformer-based language models have recently been at the forefront of active research in text generation. However, these models' advances come at the price of prohibitive training costs, with parameter counts in the billions and compute…

Computation and Language · Computer Science 2025-02-04 Gabriel Lindenmaier , Sean Papay , Sebastian Padó

Sorbet: A Neuromorphic Hardware-Compatible Transformer-Based Spiking Language Model

For reasons such as privacy, there are use cases for language models at the edge. This has given rise to small language models targeted for deployment in resource-constrained devices where energy efficiency is critical. Spiking neural…

Neural and Evolutionary Computing · Computer Science 2026-01-05 Kaiwen Tang , Zhanglu Yan , Weng-Fai Wong

Fast FullSubNet: Accelerate Full-band and Sub-band Fusion Model for Single-channel Speech Enhancement

FullSubNet is our recently proposed real-time single-channel speech enhancement network that achieves outstanding performance on the Deep Noise Suppression (DNS) Challenge dataset. A number of variants of FullSubNet have been proposed, but…

Audio and Speech Processing · Electrical Eng. & Systems 2023-03-08 Xiang Hao , Xiaofei Li

EduBERT: Pretrained Deep Language Models for Learning Analytics

The use of large pretrained neural networks to create contextualized word embeddings has drastically improved performance on several natural language processing (NLP) tasks. These computationally expensive models have begun to be applied to…

Computers and Society · Computer Science 2019-12-03 Benjamin Clavié , Kobi Gal

SIPA: A Simple Framework for Efficient Networks

With the success of deep learning in various fields and the advent of numerous Internet of Things (IoT) devices, it is essential to lighten models suitable for low-power devices. In keeping with this trend, MicroNet Challenge, which is the…

Machine Learning · Computer Science 2021-03-04 Gihun Lee , Sangmin Bae , Jaehoon Oh , Se-Young Yun

MicroSpec: Accelerating Speculative Decoding with Lightweight In-Context Vocabularies

Large language models typically employ vocabularies of over 100k tokens, which creates a major computational bottleneck at the final linear projection layer when performing speculative decoding. Current methods for vocabulary pruning depend…

Computation and Language · Computer Science 2026-05-27 Zhiyang Chen , Daliang Xu , Yinyuan Zhang , Chenghua Wang , Mengwei Xu , Yun Ma

Learning to Screen for Fast Softmax Inference on Large Vocabulary Neural Networks

Neural language models have been widely used in various NLP tasks, including machine translation, next word prediction and conversational agents. However, it is challenging to deploy these models on mobile devices due to their slow…

Machine Learning · Computer Science 2018-10-31 Patrick H. Chen , Si Si , Sanjiv Kumar , Yang Li , Cho-Jui Hsieh

Compressing Neural Language Models by Sparse Word Representations

Neural networks are among the state-of-the-art techniques for language modeling. Existing neural language models typically map discrete words to distributed, dense vector representations. After information processing of the preceding…

Computation and Language · Computer Science 2016-10-14 Yunchuan Chen , Lili Mou , Yan Xu , Ge Li , Zhi Jin

Lightweight Adaptation of Neural Language Models via Subspace Embedding

Traditional neural word embeddings are usually dependent on a richer diversity of vocabulary. However, the language models recline to cover major vocabularies via the word embedding parameters, in particular, for multilingual language…

Computation and Language · Computer Science 2023-08-21 Amit Kumar Jaiswal , Haiming Liu

HyperPELT: Unified Parameter-Efficient Language Model Tuning for Both Language and Vision-and-Language Tasks

The workflow of pretraining and fine-tuning has emerged as a popular paradigm for solving various NLP and V&L (Vision-and-Language) downstream tasks. With the capacity of pretrained models growing rapidly, how to perform parameter-efficient…

Computation and Language · Computer Science 2022-03-09 Zhengkun Zhang , Wenya Guo , Xiaojun Meng , Yasheng Wang , Yadao Wang , Xin Jiang , Qun Liu , Zhenglu Yang

Fast and Simple Mixture of Softmaxes with BPE and Hybrid-LightRNN for Language Generation

Mixture of Softmaxes (MoS) has been shown to be effective at addressing the expressiveness limitation of Softmax-based models. Despite the known advantage, MoS is practically sealed by its large consumption of memory and computational time…

Computation and Language · Computer Science 2019-06-27 Xiang Kong , Qizhe Xie , Zihang Dai , Eduard Hovy

ComplexityNet: Increasing LLM Inference Efficiency by Learning Task Complexity

We present ComplexityNet, a streamlined language model designed for assessing task complexity. This model predicts the likelihood of accurate output by various language models, each with different capabilities. Our initial application of…

Computation and Language · Computer Science 2024-10-16 Henry Bae , Aghyad Deeb , Alex Fleury , Kehang Zhu

GroupReduce: Block-Wise Low-Rank Approximation for Neural Language Model Shrinking

Model compression is essential for serving large deep neural nets on devices with limited resources or applications that require real-time responses. As a case study, a state-of-the-art neural language model usually consists of one or more…

Computation and Language · Computer Science 2018-06-20 Patrick H. Chen , Si Si , Yang Li , Ciprian Chelba , Cho-jui Hsieh

Language Model Networks: Supervision-Efficient Learning through Dense Communication

Language models are increasingly used not only as standalone predictors but also as components in larger inference systems, from test-time reasoning to multi-model collaboration. We study language model networks, where pre-trained language…

Artificial Intelligence · Computer Science 2026-05-14 Shiguang Wu , Yaqing Wang , Quanming Yao

NeuroPrune: A Neuro-inspired Topological Sparse Training Algorithm for Large Language Models

Transformer-based Language Models have become ubiquitous in Natural Language Processing (NLP) due to their impressive performance on various tasks. However, expensive training as well as inference remains a significant impediment to their…

Machine Learning · Computer Science 2024-06-06 Amit Dhurandhar , Tejaswini Pedapati , Ronny Luss , Soham Dan , Aurelie Lozano , Payel Das , Georgios Kollias

Real-Time Execution of Large-scale Language Models on Mobile

Pre-trained large-scale language models have increasingly demonstrated high accuracy on many natural language processing (NLP) tasks. However, the limited weight storage and computational speed on hardware platforms have impeded the…

Computation and Language · Computer Science 2020-10-23 Wei Niu , Zhenglun Kong , Geng Yuan , Weiwen Jiang , Jiexiong Guan , Caiwen Ding , Pu Zhao , Sijia Liu , Bin Ren , Yanzhi Wang

Softermax: Hardware/Software Co-Design of an Efficient Softmax for Transformers

Transformers have transformed the field of natural language processing. This performance is largely attributed to the use of stacked self-attention layers, each of which consists of matrix multiplies as well as softmax operations. As a…

Hardware Architecture · Computer Science 2021-03-18 Jacob R. Stevens , Rangharajan Venkatesan , Steve Dai , Brucek Khailany , Anand Raghunathan