Related papers: Softmax Optimizations for Intel Xeon Processor-bas…

Softermax: Hardware/Software Co-Design of an Efficient Softmax for Transformers

Transformers have transformed the field of natural language processing. This performance is largely attributed to the use of stacked self-attention layers, each of which consists of matrix multiplies as well as softmax operations. As a…

Hardware Architecture · Computer Science 2021-03-18 Jacob R. Stevens , Rangharajan Venkatesan , Steve Dai , Brucek Khailany , Anand Raghunathan

Online normalizer calculation for softmax

The Softmax function is ubiquitous in machine learning, multiple previous works suggested faster alternatives for it. In this paper we propose a way to compute classical Softmax with fewer memory accesses and hypothesize that this reduction…

Performance · Computer Science 2018-07-31 Maxim Milakov , Natalia Gimelshein

A Quantitative Evaluation of Approximate Softmax Functions for Deep Neural Networks

The softmax function is a widely used activation function in the output layers of neural networks, responsible for converting raw scores into class probabilities while introducing essential non-linearity. Implementing Softmax efficiently…

Hardware Architecture · Computer Science 2026-04-09 Anthony Leiva-Valverde , Fabricio Elizondo-Fernández , Luis G. León-Vega , Cristina Meinhardt , Jorge Castro-Godínez

Learning distributed representations with efficient SoftMax normalization

Learning distributed representations, or embeddings, that encode the relational similarity patterns among objects is a relevant task in machine learning. A popular method to learn the embedding matrices $X, Y$ is optimizing a loss function…

Machine Learning · Computer Science 2025-06-03 Lorenzo Dall'Amico , Enrico Maria Belliardo

Speeding Up Entmax

Softmax is the de facto standard in modern neural networks for language processing when it comes to normalizing logits. However, by producing a dense probability distribution each token in the vocabulary has a nonzero chance of being…

Computation and Language · Computer Science 2022-05-20 Maxat Tezekbayev , Vassilina Nikoulina , Matthias Gallé , Zhenisbek Assylbekov

Sparse-softmax: A Simpler and Faster Alternative Softmax Transformation

The softmax function is widely used in artificial neural networks for the multiclass classification problems, where the softmax transformation enforces the output to be positive and sum to one, and the corresponding loss function allows to…

Machine Learning · Computer Science 2021-12-24 Shaoshi Sun , Zhenyuan Zhang , BoCheng Huang , Pengbin Lei , Jianlin Su , Shengfeng Pan , Jiarun Cao

Effectiveness of MPC-friendly Softmax Replacement

Softmax is widely used in deep learning to map some representation to a probability distribution. As it is based on exp/log functions that are relatively expensive in multi-party computation, Mohassel and Zhang (2017) proposed a simpler…

Machine Learning · Computer Science 2021-07-07 Marcel Keller , Ke Sun

Efficient Softmax Approximation for Deep Neural Networks with Attention Mechanism

There has been a rapid advance of custom hardware (HW) for accelerating the inference speed of deep neural networks (DNNs). Previously, the softmax layer was not a main concern of DNN accelerating HW, because its portion is relatively small…

Machine Learning · Computer Science 2021-11-23 Ihor Vasyltsov , Wooseok Chang

Adaptive Sampled Softmax with Kernel Based Sampling

Softmax is the most commonly used output function for multiclass problems and is widely used in areas such as vision, natural language processing, and recommendation. A softmax model has linear costs in the number of classes which makes it…

Machine Learning · Computer Science 2018-08-03 Guy Blanc , Steffen Rendle

Efficient Supernet Training with Orthogonal Softmax for Scalable ASR Model Compression

ASR systems are deployed across diverse environments, each with specific hardware constraints. We use supernet training to jointly train multiple encoders of varying sizes, enabling dynamic model size adjustment to fit hardware constraints…

Computation and Language · Computer Science 2025-02-05 Jingjing Xu , Eugen Beck , Zijian Yang , Ralf Schlüter

Exploring the Impact of Temperature Scaling in Softmax for Classification and Adversarial Robustness

The softmax function is a fundamental component in deep learning. This study delves into the often-overlooked parameter within the softmax function, known as "temperature," providing novel insights into the practical and theoretical aspects…

Machine Learning · Computer Science 2025-03-03 Hao Xuan , Bokai Yang , Xingyu Li

SSN: Learning Sparse Switchable Normalization via SparsestMax

Normalization methods improve both optimization and generalization of ConvNets. To further boost performance, the recently-proposed switchable normalization (SN) provides a new perspective for deep learning: it learns to select different…

Computer Vision and Pattern Recognition · Computer Science 2019-03-12 Wenqi Shao , Tianjian Meng , Jingyu Li , Ruimao Zhang , Yudian Li , Xiaogang Wang , Ping Luo

Softmax Dissection: Towards Understanding Intra- and Inter-class Objective for Embedding Learning

The softmax loss and its variants are widely used as objectives for embedding learning, especially in applications like face recognition. However, the intra- and inter-class objectives in the softmax loss are entangled, therefore a…

Computer Vision and Pattern Recognition · Computer Science 2020-02-13 Lanqing He , Zhongdao Wang , Yali Li , Shengjin Wang

Exploring the Frontiers of Softmax: Provable Optimization, Applications in Diffusion Model, and Beyond

The softmax activation function plays a crucial role in the success of large language models (LLMs), particularly in the self-attention mechanism of the widely adopted Transformer architecture. However, the underlying learning dynamics that…

Machine Learning · Computer Science 2026-01-27 Yang Cao , Yingyu Liang , Zhenmei Shi , Zhao Song

An Alternative Softmax Operator for Reinforcement Learning

A softmax operator applied to a set of values acts somewhat like the maximization function and somewhat like an average. In sequential decision making, softmax is often used in settings where it is necessary to maximize utility but also to…

Artificial Intelligence · Computer Science 2017-06-15 Kavosh Asadi , Michael L. Littman

Efficient softmax approximation for GPUs

We propose an approximate strategy to efficiently train neural network based language models over very large vocabularies. Our approach, called adaptive softmax, circumvents the linear dependency on the vocabulary size by exploiting the…

Computation and Language · Computer Science 2017-06-20 Edouard Grave , Armand Joulin , Moustapha Cissé , David Grangier , Hervé Jégou

SoftNeuro: Fast Deep Inference using Multi-platform Optimization

Faster inference of deep learning models is highly demanded on edge devices and even servers, for both financial and environmental reasons. To address this issue, we propose SoftNeuro, a novel, high-performance inference framework with…

Machine Learning · Computer Science 2021-10-13 Masaki Hilaga , Yasuhiro Kuroda , Hitoshi Matsuo , Tatsuya Kawaguchi , Gabriel Ogawa , Hiroshi Miyake , Yusuke Kozawa

The Two-Pass Softmax Algorithm

The softmax (also called softargmax) function is widely used in machine learning models to normalize real-valued scores into a probability distribution. To avoid floating-point overflow, the softmax function is conventionally implemented in…

Performance · Computer Science 2020-01-14 Marat Dukhan , Artsiom Ablavatski

Efficient Sampled Softmax for Tensorflow

This short paper discusses an efficient implementation of \emph{sampled softmax loss} for Tensorflow. The speedup over the default implementation is achieved due to simplification of the graph for the forward and backward passes.

Machine Learning · Computer Science 2020-04-14 Maciej Skorski

Compute-Efficient Medical Image Classification with Softmax-Free Transformers and Sequence Normalization

The Transformer model has been pivotal in advancing fields such as natural language processing, speech recognition, and computer vision. However, a critical limitation of this model is its quadratic computational and memory complexity…

Computer Vision and Pattern Recognition · Computer Science 2024-06-04 Firas Khader , Omar S. M. El Nahhas , Tianyu Han , Gustav Müller-Franzes , Sven Nebelung , Jakob Nikolas Kather , Daniel Truhn