Related papers: The Two-Pass Softmax Algorithm

Sparse-softmax: A Simpler and Faster Alternative Softmax Transformation

The softmax function is widely used in artificial neural networks for the multiclass classification problems, where the softmax transformation enforces the output to be positive and sum to one, and the corresponding loss function allows to…

Machine Learning · Computer Science 2021-12-24 Shaoshi Sun , Zhenyuan Zhang , BoCheng Huang , Pengbin Lei , Jianlin Su , Shengfeng Pan , Jiarun Cao

Online normalizer calculation for softmax

The Softmax function is ubiquitous in machine learning, multiple previous works suggested faster alternatives for it. In this paper we propose a way to compute classical Softmax with fewer memory accesses and hypothesize that this reduction…

Performance · Computer Science 2018-07-31 Maxim Milakov , Natalia Gimelshein

Softmax Optimizations for Intel Xeon Processor-based Platforms

Softmax is popular normalization method used in machine learning. Deep learning solutions like Transformer or BERT use the softmax function intensively, so it is worthwhile to optimize its performance. This article presents our methodology…

Mathematical Software · Computer Science 2019-05-28 Jacek Czaja , Michal Gallus , Tomasz Patejko , Jian Tang

Accurate Computation of the Log-Sum-Exp and Softmax Functions

Evaluating the log-sum-exp function or the softmax function is a key step in many modern data science algorithms, notably in inference and classification. Because of the exponentials that these functions contain, the evaluation is prone to…

Numerical Analysis · Mathematics 2019-09-10 Pierre Blanchard , Desmond J. Higham , Nicholas J. Higham

A Quantitative Evaluation of Approximate Softmax Functions for Deep Neural Networks

The softmax function is a widely used activation function in the output layers of neural networks, responsible for converting raw scores into class probabilities while introducing essential non-linearity. Implementing Softmax efficiently…

Hardware Architecture · Computer Science 2026-04-09 Anthony Leiva-Valverde , Fabricio Elizondo-Fernández , Luis G. León-Vega , Cristina Meinhardt , Jorge Castro-Godínez

MultiMax: Sparse and Multi-Modal Attention Learning

SoftMax is a ubiquitous ingredient of modern machine learning algorithms. It maps an input vector onto a probability simplex and reweights the input by concentrating the probability mass at large entries. Yet, as a smooth approximation to…

Machine Learning · Computer Science 2025-01-09 Yuxuan Zhou , Mario Fritz , Margret Keuper

BAPS: A Fine-Grained Low-Precision Scheme for Softmax in Attention via Block-Aware Precision reScaling

As the performance gains from accelerating quantized matrix multiplication plateau, the softmax operation becomes the critical bottleneck in Transformer inference. This bottleneck stems from two hardware limitations: (1) limited data…

Machine Learning · Computer Science 2026-02-03 Zisheng Ye , Xiaoyu He , Maoyuan Song , Guoliang Qiu , Chao Liao , Chen Wu , Yonggang Sun , Zhichun Li , Xiaoru Xie , Yuanyong Luo , Hu Liu , Pinyan Lu , Heng Liao

TAPAS: Two-pass Approximate Adaptive Sampling for Softmax

TAPAS is a novel adaptive sampling method for the softmax model. It uses a two pass sampling strategy where the examples used to approximate the gradient of the partition function are first sampled according to a squashed population…

Machine Learning · Computer Science 2017-07-17 Yu Bai , Sally Goldman , Li Zhang

Effectiveness of MPC-friendly Softmax Replacement

Softmax is widely used in deep learning to map some representation to a probability distribution. As it is based on exp/log functions that are relatively expensive in multi-party computation, Mohassel and Zhang (2017) proposed a simpler…

Machine Learning · Computer Science 2021-07-07 Marcel Keller , Ke Sun

Softermax: Hardware/Software Co-Design of an Efficient Softmax for Transformers

Transformers have transformed the field of natural language processing. This performance is largely attributed to the use of stacked self-attention layers, each of which consists of matrix multiplies as well as softmax operations. As a…

Hardware Architecture · Computer Science 2021-03-18 Jacob R. Stevens , Rangharajan Venkatesan , Steve Dai , Brucek Khailany , Anand Raghunathan

One-vs-Each Approximation to Softmax for Scalable Estimation of Probabilities

The softmax representation of probabilities for categorical variables plays a prominent role in modern machine learning with numerous applications in areas such as large scale classification, neural language modeling and recommendation…

Machine Learning · Statistics 2016-11-01 Michalis K. Titsias

Money on the Table: Statistical information ignored by Softmax can improve classifier accuracy

Softmax is a standard final layer used in Neural Nets (NNs) to summarize information encoded in the trained NN and return a prediction. However, Softmax leverages only a subset of the class-specific structure encoded in the trained model…

Machine Learning · Computer Science 2019-12-09 Charles B. Delahunt , Courosh Mehanian , J. Nathan Kutz

Sigsoftmax: Reanalysis of the Softmax Bottleneck

Softmax is an output activation function for modeling categorical probability distributions in many applications of deep learning. However, a recent study revealed that softmax can be a bottleneck of representational capacity of neural…

Machine Learning · Statistics 2018-05-29 Sekitoshi Kanai , Yasuhiro Fujiwara , Yuki Yamanaka , Shuichi Adachi

Speeding Up Entmax

Softmax is the de facto standard in modern neural networks for language processing when it comes to normalizing logits. However, by producing a dense probability distribution each token in the vocabulary has a nonzero chance of being…

Computation and Language · Computer Science 2022-05-20 Maxat Tezekbayev , Vassilina Nikoulina , Matthias Gallé , Zhenisbek Assylbekov

Self-Adjust Softmax

The softmax function is crucial in Transformer attention, which normalizes each row of the attention scores with summation to one, achieving superior performances over other alternative functions. However, the softmax function can face a…

Computation and Language · Computer Science 2025-02-26 Chuanyang Zheng , Yihang Gao , Guoxuan Chen , Han Shi , Jing Xiong , Xiaozhe Ren , Chao Huang , Xin Jiang , Zhenguo Li , Yu Li

SSN: Learning Sparse Switchable Normalization via SparsestMax

Normalization methods improve both optimization and generalization of ConvNets. To further boost performance, the recently-proposed switchable normalization (SN) provides a new perspective for deep learning: it learns to select different…

Computer Vision and Pattern Recognition · Computer Science 2019-03-12 Wenqi Shao , Tianjian Meng , Jingyu Li , Ruimao Zhang , Yudian Li , Xiaogang Wang , Ping Luo

Revisiting Softmax for Uncertainty Approximation in Text Classification

Uncertainty approximation in text classification is an important area with applications in domain adaptation and interpretability. One of the most widely used uncertainty approximation methods is Monte Carlo (MC) Dropout, which is…

Machine Learning · Computer Science 2023-07-20 Andreas Nugaard Holm , Dustin Wright , Isabelle Augenstein

r-softmax: Generalized Softmax with Controllable Sparsity Rate

Nowadays artificial neural network models achieve remarkable results in many disciplines. Functions mapping the representation provided by the model to the probability distribution are the inseparable aspect of deep learning solutions.…

Machine Learning · Computer Science 2023-04-24 Klaudia Bałazy , Łukasz Struski , Marek Śmieja , Jacek Tabor

Revisiting lp-constrained Softmax Loss: A Comprehensive Study

Normalization is a vital process for any machine learning task as it controls the properties of data and affects model performance at large. The impact of particular forms of normalization, however, has so far been investigated in limited…

Machine Learning · Computer Science 2022-06-22 Chintan Trivedi , Konstantinos Makantasis , Antonios Liapis , Georgios N. Yannakakis

An Exploration of Softmax Alternatives Belonging to the Spherical Loss Family

In a multi-class classification problem, it is standard to model the output of a neural network as a categorical distribution conditioned on the inputs. The output must therefore be positive and sum to one, which is traditionally enforced…

Neural and Evolutionary Computing · Computer Science 2016-03-01 Alexandre de Brébisson , Pascal Vincent