Related papers: Efficient softmax approximation for GPUs

Learning to Screen for Fast Softmax Inference on Large Vocabulary Neural Networks

Neural language models have been widely used in various NLP tasks, including machine translation, next word prediction and conversational agents. However, it is challenging to deploy these models on mobile devices due to their slow…

Machine Learning · Computer Science 2018-10-31 Patrick H. Chen , Si Si , Sanjiv Kumar , Yang Li , Cho-Jui Hsieh

Self-organized Hierarchical Softmax

We propose a new self-organizing hierarchical softmax formulation for neural-network-based language models over large vocabularies. Instead of using a predefined hierarchical structure, our approach is capable of learning word clusters with…

Computation and Language · Computer Science 2017-07-29 Yikang Shen , Shawn Tan , Chrisopher Pal , Aaron Courville

Strategies for Training Large Vocabulary Neural Language Models

Training neural network language models over large vocabularies is still computationally very costly compared to count-based models such as Kneser-Ney. At the same time, neural language models are gaining popularity for many applications…

Computation and Language · Computer Science 2015-12-16 Welin Chen , David Grangier , Michael Auli

A Factorized Recurrent Neural Network based architecture for medium to large vocabulary Language Modelling

Statistical language models are central to many applications that use semantics. Recurrent Neural Networks (RNN) are known to produce state of the art results for language modelling, outperforming their traditional n-gram counterparts in…

Computation and Language · Computer Science 2016-02-05 Anantharaman Palacode Narayana Iyer

Real-time Neural-based Input Method

The input method is an essential service on every mobile and desktop devices that provides text suggestions. It converts sequential keyboard inputs to the characters in its target language, which is indispensable for Japanese and Chinese…

Computation and Language · Computer Science 2018-10-23 Jiali Yao , Raphael Shu , Xinjian Li , Katsutoshi Ohtsuki , Hideki Nakayama

Efficient Softmax Approximation for Deep Neural Networks with Attention Mechanism

There has been a rapid advance of custom hardware (HW) for accelerating the inference speed of deep neural networks (DNNs). Previously, the softmax layer was not a main concern of DNN accelerating HW, because its portion is relatively small…

Machine Learning · Computer Science 2021-11-23 Ihor Vasyltsov , Wooseok Chang

Adaptive Input Representations for Neural Language Modeling

We introduce adaptive input representations for neural language modeling which extend the adaptive softmax of Grave et al. (2017) to input representations of variable capacity. There are several choices on how to factorize the input and…

Computation and Language · Computer Science 2019-02-26 Alexei Baevski , Michael Auli

Investigation of Large-Margin Softmax in Neural Language Modeling

To encourage intra-class compactness and inter-class separability among trainable feature vectors, large-margin softmax methods are developed and widely applied in the face recognition community. The introduction of the large-margin concept…

Audio and Speech Processing · Electrical Eng. & Systems 2021-04-22 Jingjing Huo , Yingbo Gao , Weiyue Wang , Ralf Schlüter , Hermann Ney

Attention Scheme Inspired Softmax Regression

Large language models (LLMs) have made transformed changes for human society. One of the key computation in LLMs is the softmax unit. This operation is important in LLMs because it allows the model to generate a distribution over possible…

Machine Learning · Computer Science 2023-04-27 Yichuan Deng , Zhihang Li , Zhao Song

GroupReduce: Block-Wise Low-Rank Approximation for Neural Language Model Shrinking

Model compression is essential for serving large deep neural nets on devices with limited resources or applications that require real-time responses. As a case study, a state-of-the-art neural language model usually consists of one or more…

Computation and Language · Computer Science 2018-06-20 Patrick H. Chen , Si Si , Yang Li , Ciprian Chelba , Cho-jui Hsieh

Efficient Contextual Representation Learning Without Softmax Layer

Contextual representation models have achieved great success in improving various downstream tasks. However, these language-model-based encoders are difficult to train due to the large parameter sizes and high computational complexity. By…

Computation and Language · Computer Science 2019-03-01 Liunian Harold Li , Patrick H. Chen , Cho-Jui Hsieh , Kai-Wei Chang

Optimized Speculative Sampling for GPU Hardware Accelerators

In this work, we optimize speculative sampling for parallel hardware accelerators to improve sampling speed. We notice that substantial portions of the intermediate matrices necessary for speculative sampling can be computed concurrently.…

Machine Learning · Computer Science 2024-10-04 Dominik Wagner , Seanie Lee , Ilja Baumann , Philipp Seeberger , Korbinian Riedhammer , Tobias Bocklet

Softermax: Hardware/Software Co-Design of an Efficient Softmax for Transformers

Transformers have transformed the field of natural language processing. This performance is largely attributed to the use of stacked self-attention layers, each of which consists of matrix multiplies as well as softmax operations. As a…

Hardware Architecture · Computer Science 2021-03-18 Jacob R. Stevens , Rangharajan Venkatesan , Steve Dai , Brucek Khailany , Anand Raghunathan

MicroNet for Efficient Language Modeling

It is important to design compact language models for efficient deployment. We improve upon recent advances in both the language modeling domain and the model-compression domain to construct parameter and computation efficient language…

Computation and Language · Computer Science 2020-05-19 Zhongxia Yan , Hanrui Wang , Demi Guo , Song Han

Navigating with Graph Representations for Fast and Scalable Decoding of Neural Language Models

Neural language models (NLMs) have recently gained a renewed interest by achieving state-of-the-art performance across many natural language processing (NLP) tasks. However, NLMs are very computationally demanding largely due to the…

Computation and Language · Computer Science 2018-06-13 Minjia Zhang , Xiaodong Liu , Wenhan Wang , Jianfeng Gao , Yuxiong He

Speeding Up Entmax

Softmax is the de facto standard in modern neural networks for language processing when it comes to normalizing logits. However, by producing a dense probability distribution each token in the vocabulary has a nonzero chance of being…

Computation and Language · Computer Science 2022-05-20 Maxat Tezekbayev , Vassilina Nikoulina , Matthias Gallé , Zhenisbek Assylbekov

Von Mises-Fisher Loss for Training Sequence to Sequence Models with Continuous Outputs

The Softmax function is used in the final layer of nearly all existing sequence-to-sequence models for language generation. However, it is usually the slowest layer to compute which limits the vocabulary size to a subset of most frequent…

Computation and Language · Computer Science 2019-03-25 Sachin Kumar , Yulia Tsvetkov

Smaller Text Classifiers with Discriminative Cluster Embeddings

Word embedding parameters often dominate overall model sizes in neural methods for natural language processing. We reduce deployed model sizes of text classifiers by learning a hard word clustering in an end-to-end manner. We use the…

Computation and Language · Computer Science 2019-06-25 Mingda Chen , Kevin Gimpel

Scalable-Softmax Is Superior for Attention

The maximum element of the vector output by the Softmax function approaches zero as the input vector size increases. Transformer-based language models rely on Softmax to compute attention scores, causing the attention distribution to…

Computation and Language · Computer Science 2025-02-03 Ken M. Nakanishi

Beyond Linear Approximations: A Novel Pruning Approach for Attention Matrix

Large Language Models (LLMs) have shown immense potential in enhancing various aspects of our daily lives, from conversational AI to search and AI assistants. However, their growing capabilities come at the cost of extremely large model…

Machine Learning · Computer Science 2025-02-27 Yingyu Liang , Jiangxuan Long , Zhenmei Shi , Zhao Song , Yufa Zhou