Related papers: Efficient Sampled Softmax for Tensorflow

Softmax Optimizations for Intel Xeon Processor-based Platforms

Softmax is popular normalization method used in machine learning. Deep learning solutions like Transformer or BERT use the softmax function intensively, so it is worthwhile to optimize its performance. This article presents our methodology…

Mathematical Software · Computer Science 2019-05-28 Jacek Czaja , Michal Gallus , Tomasz Patejko , Jian Tang

Accurate Computation of the Log-Sum-Exp and Softmax Functions

Evaluating the log-sum-exp function or the softmax function is a key step in many modern data science algorithms, notably in inference and classification. Because of the exponentials that these functions contain, the evaluation is prone to…

Numerical Analysis · Mathematics 2019-09-10 Pierre Blanchard , Desmond J. Higham , Nicholas J. Higham

Speeding Up Entmax

Softmax is the de facto standard in modern neural networks for language processing when it comes to normalizing logits. However, by producing a dense probability distribution each token in the vocabulary has a nonzero chance of being…

Computation and Language · Computer Science 2022-05-20 Maxat Tezekbayev , Vassilina Nikoulina , Matthias Gallé , Zhenisbek Assylbekov

Efficient Image Representation Learning with Federated Sampled Softmax

Learning image representations on decentralized data can bring many benefits in cases where data cannot be aggregated across data silos. Softmax cross entropy loss is highly effective and commonly used for learning image representations.…

Machine Learning · Computer Science 2022-03-10 Sagar M. Waghmare , Hang Qi , Huizhong Chen , Mikhail Sirotenko , Tomer Meron

Sampled Softmax with Random Fourier Features

The computational cost of training with softmax cross entropy loss grows linearly with the number of classes. For the settings where a large number of classes are involved, a common method to speed up training is to sample a subset of…

Machine Learning · Computer Science 2020-01-01 Ankit Singh Rawat , Jiecao Chen , Felix Yu , Ananda Theertha Suresh , Sanjiv Kumar

Softermax: Hardware/Software Co-Design of an Efficient Softmax for Transformers

Transformers have transformed the field of natural language processing. This performance is largely attributed to the use of stacked self-attention layers, each of which consists of matrix multiplies as well as softmax operations. As a…

Hardware Architecture · Computer Science 2021-03-18 Jacob R. Stevens , Rangharajan Venkatesan , Steve Dai , Brucek Khailany , Anand Raghunathan

Sparse-softmax: A Simpler and Faster Alternative Softmax Transformation

The softmax function is widely used in artificial neural networks for the multiclass classification problems, where the softmax transformation enforces the output to be positive and sum to one, and the corresponding loss function allows to…

Machine Learning · Computer Science 2021-12-24 Shaoshi Sun , Zhenyuan Zhang , BoCheng Huang , Pengbin Lei , Jianlin Su , Shengfeng Pan , Jiarun Cao

Memory-Efficient Training of RNN-Transducer with Sampled Softmax

RNN-Transducer has been one of promising architectures for end-to-end automatic speech recognition. Although RNN-Transducer has many advantages including its strong accuracy and streaming-friendly property, its high memory consumption…

Audio and Speech Processing · Electrical Eng. & Systems 2022-04-01 Jaesong Lee , Lukas Lee , Shinji Watanabe

A Tale of Two Efficient and Informative Negative Sampling Distributions

Softmax classifiers with a very large number of classes naturally occur in many applications such as natural language processing and information retrieval. The calculation of full softmax is costly from the computational and energy…

Machine Learning · Computer Science 2021-07-30 Shabnam Daghaghi , Tharun Medini , Nicholas Meisburger , Beidi Chen , Mengnan Zhao , Anshumali Shrivastava

Flex-TPU: A Flexible TPU with Runtime Reconfigurable Dataflow Architecture

Tensor processing units (TPUs) are one of the most well-known machine learning (ML) accelerators utilized at large scale in data centers as well as in tiny ML applications. TPUs offer several improvements and advantages over conventional ML…

Hardware Architecture · Computer Science 2024-07-12 Mohammed Elbtity , Peyton Chandarana , Ramtin Zand

Data-Efficient Training by Evolved Sampling

Data selection is designed to accelerate learning with preserved performance. To achieve this, a fundamental thought is to identify informative data samples with significant contributions to the training. In this work, we propose…

Machine Learning · Computer Science 2025-09-30 Ziheng Cheng , Zhong Li , Jiang Bian

Online normalizer calculation for softmax

The Softmax function is ubiquitous in machine learning, multiple previous works suggested faster alternatives for it. In this paper we propose a way to compute classical Softmax with fewer memory accesses and hypothesize that this reduction…

Performance · Computer Science 2018-07-31 Maxim Milakov , Natalia Gimelshein

XES Tensorflow - Process Prediction using the Tensorflow Deep-Learning Framework

Predicting the next activity of a running process is an important aspect of process management. Recently, artificial neural networks, so called deep-learning approaches, have been proposed to address this challenge. This demo paper…

Machine Learning · Computer Science 2017-05-04 Joerg Evermann , Jana-Rebecca Rehse , Peter Fettke

Transparent FPGA Acceleration with TensorFlow

Today, artificial neural networks are one of the major innovators pushing the progress of machine learning. This has particularly affected the development of neural network accelerating hardware. However, since most of these architectures…

Hardware Architecture · Computer Science 2021-02-12 Simon Pfenning , Philipp Holzinger , Marc Reichenbach

TensorFlow Doing HPC

TensorFlow is a popular emerging open-source programming framework supporting the execution of distributed applications on heterogeneous hardware. While TensorFlow has been initially designed for developing Machine Learning (ML)…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-03-03 Steven W. D. Chien , Stefano Markidis , Vyacheslav Olshevsky , Yaroslav Bulatov , Erwin Laure , Jeffrey S. Vetter

Efficient softmax approximation for GPUs

We propose an approximate strategy to efficiently train neural network based language models over very large vocabularies. Our approach, called adaptive softmax, circumvents the linear dependency on the vocabulary size by exploiting the…

Computation and Language · Computer Science 2017-06-20 Edouard Grave , Armand Joulin , Moustapha Cissé , David Grangier , Hervé Jégou

Adaptive Sampled Softmax with Inverted Multi-Index: Methods, Theory and Applications

The softmax function is a cornerstone of multi-class classification, integral to a wide range of machine learning applications, from large-scale retrieval and ranking models to advanced large language models. However, its computational cost…

Machine Learning · Computer Science 2025-01-16 Jin Chen , Jin Zhang , Xu huang , Yi Yang , Defu Lian , Enhong Chen

Swift for TensorFlow: A portable, flexible platform for deep learning

Swift for TensorFlow is a deep learning platform that scales from mobile devices to clusters of hardware accelerators in data centers. It combines a language-integrated automatic differentiation system and multiple Tensor implementations…

Machine Learning · Computer Science 2021-03-01 Brennan Saeta , Denys Shabalin , Marc Rasi , Brad Larson , Xihui Wu , Parker Schuh , Michelle Casbon , Daniel Zheng , Saleem Abdulrasool , Aleksandr Efremov , Dave Abrahams , Chris Lattner , Richard Wei

TensorTrace: an application to contract tensor networks

Tensor network methods are a conceptually elegant framework for encoding complicated datasets, where high-order tensors are approximated as networks of low-order tensors. In practice, however, the numeric implementation of tensor network…

Quantum Physics · Physics 2019-11-07 Glen Evenbly

TensorFlow: A system for large-scale machine learning

TensorFlow is a machine learning system that operates at large scale and in heterogeneous environments. TensorFlow uses dataflow graphs to represent computation, shared state, and the operations that mutate that state. It maps the nodes of…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-06-01 Martín Abadi , Paul Barham , Jianmin Chen , Zhifeng Chen , Andy Davis , Jeffrey Dean , Matthieu Devin , Sanjay Ghemawat , Geoffrey Irving , Michael Isard , Manjunath Kudlur , Josh Levenberg , Rajat Monga , Sherry Moore , Derek G. Murray , Benoit Steiner , Paul Tucker , Vijay Vasudevan , Pete Warden , Martin Wicke , Yuan Yu , Xiaoqiang Zheng