Related papers: Learning distributed representations with efficien…

Softmax Optimizations for Intel Xeon Processor-based Platforms

Softmax is popular normalization method used in machine learning. Deep learning solutions like Transformer or BERT use the softmax function intensively, so it is worthwhile to optimize its performance. This article presents our methodology…

Mathematical Software · Computer Science 2019-05-28 Jacek Czaja , Michal Gallus , Tomasz Patejko , Jian Tang

Breaking the Softmax Bottleneck via Learnable Monotonic Pointwise Non-linearities

The Softmax function on top of a final linear layer is the de facto method to output probability distributions in neural networks. In many applications such as language models or text generation, this model has to produce distributions over…

Machine Learning · Computer Science 2019-05-15 Octavian-Eugen Ganea , Sylvain Gelly , Gary Bécigneul , Aliaksei Severyn

Improving Optimization for Models With Continuous Symmetry Breaking

Many loss functions in representation learning are invariant under a continuous symmetry transformation. For example, the loss function of word embeddings (Mikolov et al., 2013) remains unchanged if we simultaneously rotate all word and…

Machine Learning · Statistics 2020-07-21 Robert Bamler , Stephan Mandt

Sum Estimation via Vector Similarity Search

Semantic embeddings to represent objects such as image, text and audio are widely used in machine learning and have spurred the development of vector similarity search methods for retrieving semantically related objects. In this work, we…

Data Structures and Algorithms · Computer Science 2026-01-21 Stephen Mussmann , Mehul Smriti Raje , Kavya Tumkur , Oumayma Messoussi , Cyprien Hachem , Seby Jacob

One-vs-Each Approximation to Softmax for Scalable Estimation of Probabilities

The softmax representation of probabilities for categorical variables plays a prominent role in modern machine learning with numerous applications in areas such as large scale classification, neural language modeling and recommendation…

Machine Learning · Statistics 2016-11-01 Michalis K. Titsias

Efficient Image Representation Learning with Federated Sampled Softmax

Learning image representations on decentralized data can bring many benefits in cases where data cannot be aggregated across data silos. Softmax cross entropy loss is highly effective and commonly used for learning image representations.…

Machine Learning · Computer Science 2022-03-10 Sagar M. Waghmare , Hang Qi , Huizhong Chen , Mikhail Sirotenko , Tomer Meron

Inducing and Embedding Senses with Scaled Gumbel Softmax

Methods for learning word sense embeddings represent a single word with multiple sense-specific vectors. These methods should not only produce interpretable sense embeddings, but should also learn how to select which sense to use in a given…

Computation and Language · Computer Science 2019-12-17 Fenfei Guo , Mohit Iyyer , Jordan Boyd-Graber

Joint Discriminative and Metric Embedding Learning for Person Re-Identification

Person re-identification is a challenging task because of the high intra-class variance induced by the unrestricted nuisance factors of variations such as pose, illumination, viewpoint, background, and sensor noise. Recent approaches…

Computer Vision and Pattern Recognition · Computer Science 2023-01-02 Sinan Sabri , Zaigham Randhawa , Gianfranco Doretto

Efficient Learning for Undirected Topic Models

Replicated Softmax model, a well-known undirected topic model, is powerful in extracting semantic representations of documents. Traditional learning strategies such as Contrastive Divergence are very inefficient. This paper provides a novel…

Machine Learning · Computer Science 2015-06-25 Jiatao Gu , Victor O. K. Li

Deriving the Scaled-Dot-Function via Maximum Likelihood Estimation and Maximum Entropy Approach

In this paper, we present a maximum likelihood estimation approach to determine the value vector in transformer models. We model the sequence of value vectors, key vectors, and the query vector as a sequence of Gaussian distributions. The…

Machine Learning · Computer Science 2025-09-17 Jiyong Ma

SSN: Learning Sparse Switchable Normalization via SparsestMax

Normalization methods improve both optimization and generalization of ConvNets. To further boost performance, the recently-proposed switchable normalization (SN) provides a new perspective for deep learning: it learns to select different…

Computer Vision and Pattern Recognition · Computer Science 2019-03-12 Wenqi Shao , Tianjian Meng , Jingyu Li , Ruimao Zhang , Yudian Li , Xiaogang Wang , Ping Luo

Escaping the Gradient Vanishing: Periodic Alternatives of Softmax in Attention Mechanism

Softmax is widely used in neural networks for multiclass classification, gate structure and attention mechanisms. The statistical assumption that the input is normal distributed supports the gradient stability of Softmax. However, when used…

Computer Vision and Pattern Recognition · Computer Science 2021-08-17 Shulun Wang , Bin Liu , Feng Liu

Revisiting lp-constrained Softmax Loss: A Comprehensive Study

Normalization is a vital process for any machine learning task as it controls the properties of data and affects model performance at large. The impact of particular forms of normalization, however, has so far been investigated in limited…

Machine Learning · Computer Science 2022-06-22 Chintan Trivedi , Konstantinos Makantasis , Antonios Liapis , Georgios N. Yannakakis

SoftTriple Loss: Deep Metric Learning Without Triplet Sampling

Distance metric learning (DML) is to learn the embeddings where examples from the same class are closer than examples from different classes. It can be cast as an optimization problem with triplet constraints. Due to the vast number of…

Computer Vision and Pattern Recognition · Computer Science 2020-04-16 Qi Qian , Lei Shang , Baigui Sun , Juhua Hu , Hao Li , Rong Jin

Deep Learning using Linear Support Vector Machines

Recently, fully-connected and convolutional neural networks have been trained to achieve state-of-the-art performance on a wide variety of tasks such as speech recognition, image classification, natural language processing, and…

Machine Learning · Computer Science 2015-02-24 Yichuan Tang

Word Embedding based on Low-Rank Doubly Stochastic Matrix Decomposition

Word embedding, which encodes words into vectors, is an important starting point in natural language processing and commonly used in many text-based machine learning tasks. However, in most current word embedding approaches, the similarity…

Computation and Language · Computer Science 2018-12-27 Denis Sedov , Zhirong Yang

Softmax Dissection: Towards Understanding Intra- and Inter-class Objective for Embedding Learning

The softmax loss and its variants are widely used as objectives for embedding learning, especially in applications like face recognition. However, the intra- and inter-class objectives in the softmax loss are entangled, therefore a…

Computer Vision and Pattern Recognition · Computer Science 2020-02-13 Lanqing He , Zhongdao Wang , Yali Li , Shengjin Wang

Evidential Softmax for Sparse Multimodal Distributions in Deep Generative Models

Many applications of generative models rely on the marginalization of their high-dimensional output probability distributions. Normalization functions that yield sparse probability distributions can make exact marginalization more…

Machine Learning · Computer Science 2021-10-28 Phil Chen , Masha Itkina , Ransalu Senanayake , Mykel J. Kochenderfer

SVMax: A Feature Embedding Regularizer

A neural network regularizer (e.g., weight decay) boosts performance by explicitly penalizing the complexity of a network. In this paper, we penalize inferior network activations -- feature embeddings -- which in turn regularize the…

Computer Vision and Pattern Recognition · Computer Science 2021-03-05 Ahmed Taha , Alex Hanson , Abhinav Shrivastava , Larry Davis

Estimation of embedding vectors in high dimensions

Embeddings are a basic initial feature extraction step in many machine learning models, particularly in natural language processing. An embedding attempts to map data tokens to a low-dimensional space where similar tokens are mapped to…

Machine Learning · Computer Science 2025-04-10 Golara Ahmadi Azar , Melika Emami , Alyson Fletcher , Sundeep Rangan