Related papers: Binary Hypothesis Testing for Softmax Models and L…

Attention Scheme Inspired Softmax Regression

Large language models (LLMs) have made transformed changes for human society. One of the key computation in LLMs is the softmax unit. This operation is important in LLMs because it allows the model to generate a distribution over possible…

Machine Learning · Computer Science 2023-04-27 Yichuan Deng , Zhihang Li , Zhao Song

Leveraging Uncertainty Estimates To Improve Classifier Performance

Binary classification involves predicting the label of an instance based on whether the model score for the positive class exceeds a threshold chosen based on the application requirements (e.g., maximizing recall for a precision bound).…

Machine Learning · Computer Science 2023-11-21 Gundeep Arora , Srujana Merugu , Anoop Saladi , Rajeev Rastogi

Statistical Advantage of Softmax Attention: Insights from Single-Location Regression

Large language models rely on attention mechanisms with a softmax activation. Yet the dominance of softmax over alternatives (e.g., component-wise or linear) remains poorly understood, and many theoretical works have focused on the…

Machine Learning · Computer Science 2026-02-27 O. Duranthon , P. Marion , C. Boyer , B. Loureiro , L. Zdeborová

Density-Softmax: Efficient Test-time Model for Uncertainty Estimation and Robustness under Distribution Shifts

Sampling-based methods, e.g., Deep Ensembles and Bayesian Neural Nets have become promising approaches to improve the quality of uncertainty estimation and robust generalization. However, they suffer from a large model size and high latency…

Machine Learning · Computer Science 2024-05-29 Ha Manh Bui , Anqi Liu

Sparse-softmax: A Simpler and Faster Alternative Softmax Transformation

The softmax function is widely used in artificial neural networks for the multiclass classification problems, where the softmax transformation enforces the output to be positive and sum to one, and the corresponding loss function allows to…

Machine Learning · Computer Science 2021-12-24 Shaoshi Sun , Zhenyuan Zhang , BoCheng Huang , Pengbin Lei , Jianlin Su , Shengfeng Pan , Jiarun Cao

Adaptive Sampled Softmax with Kernel Based Sampling

Softmax is the most commonly used output function for multiclass problems and is widely used in areas such as vision, natural language processing, and recommendation. A softmax model has linear costs in the number of classes which makes it…

Machine Learning · Computer Science 2018-08-03 Guy Blanc , Steffen Rendle

Robust Multi-Hypothesis Testing with Moment Constrained Uncertainty Sets

The problem of robust binary hypothesis testing is studied. Under both hypotheses, the data-generating distributions are assumed to belong to uncertainty sets constructed through moments; in particular, the sets contain distributions whose…

Statistics Theory · Mathematics 2024-01-09 Akshayaa Magesh , Zhongchang Sun , Venugopal V. Veeravalli , Shaofeng Zou

One-vs-Each Approximation to Softmax for Scalable Estimation of Probabilities

The softmax representation of probabilities for categorical variables plays a prominent role in modern machine learning with numerous applications in areas such as large scale classification, neural language modeling and recommendation…

Machine Learning · Statistics 2016-11-01 Michalis K. Titsias

A binary-response regression model based on support vector machines

The soft-margin support vector machine (SVM) is a ubiquitous tool for prediction of binary-response data. However, the SVM is characterized entirely via a numerical optimization problem, rather than a probability model, and thus does not…

Methodology · Statistics 2020-07-24 Hien D Nguyen , Daniel V Fryer

Image Score: How to Select Useful Samples

There has long been debates on how we could interpret neural networks and understand the decisions our models make. Specifically, why deep neural networks tend to be error-prone when dealing with samples that output low softmax scores. We…

Computer Vision and Pattern Recognition · Computer Science 2018-12-04 Simiao Zuo , Jialin Wu

Resolving Out-of-Vocabulary Words with Bilingual Embeddings in Machine Translation

Out-of-vocabulary words account for a large proportion of errors in machine translation systems, especially when the system is used on a different domain than the one where it was trained. In order to alleviate the problem, we propose to…

Computation and Language · Computer Science 2016-08-08 Pranava Swaroop Madhyastha , Cristina España-Bonet

Distribution-restrained Softmax Loss for the Model Robustness

Recently, the robustness of deep learning models has received widespread attention, and various methods for improving model robustness have been proposed, including adversarial training, model architecture modification, design of loss…

Machine Learning · Computer Science 2023-03-23 Hao Wang , Chen Li , Jinzhe Jiang , Xin Zhang , Yaqian Zhao , Weifeng Gong

In-Context Learning for Attention Scheme: from Single Softmax Regression to Multiple Softmax Regression via a Tensor Trick

Large language models (LLMs) have brought significant and transformative changes in human society. These models have demonstrated remarkable capabilities in natural language understanding and generation, leading to various advancements and…

Machine Learning · Computer Science 2023-07-06 Yeqi Gao , Zhao Song , Shenghao Xie

Large-Margin Classification with Multiple Decision Rules

Binary classification is a common statistical learning problem in which a model is estimated on a set of covariates for some outcome indicating the membership of one of two classes. In the literature, there exists a distinction between hard…

Machine Learning · Statistics 2014-11-20 Patrick K. Kimes , D. Neil Hayes , J. S. Marron , Yufeng Liu

Breaking the Softmax Bottleneck via Learnable Monotonic Pointwise Non-linearities

The Softmax function on top of a final linear layer is the de facto method to output probability distributions in neural networks. In many applications such as language models or text generation, this model has to produce distributions over…

Machine Learning · Computer Science 2019-05-15 Octavian-Eugen Ganea , Sylvain Gelly , Gary Bécigneul , Aliaksei Severyn

Large Deviation Analysis of Score-based Hypothesis Testing

Score-based statistical models play an important role in modern machine learning, statistics, and signal processing. For hypothesis testing, a score-based hypothesis test is proposed in \cite{wu2022score}. We analyze the performance of this…

Signal Processing · Electrical Eng. & Systems 2024-02-06 Enmao Diao , Taposh Banerjee , Vahid Tarokh

Think Twice: Measuring the Efficiency of Eliminating Prediction Shortcuts of Question Answering Models

While the Large Language Models (LLMs) dominate a majority of language understanding tasks, previous work shows that some of these results are supported by modelling spurious correlations of training datasets. Authors commonly assess model…

Computation and Language · Computer Science 2024-02-07 Lukáš Mikula , Michal Štefánik , Marek Petrovič , Petr Sojka

Bridging the Divide: Reconsidering Softmax and Linear Attention

Widely adopted in modern Vision Transformer designs, Softmax attention can effectively capture long-range visual information; however, it incurs excessive computational cost when dealing with high-resolution inputs. In contrast, linear…

Computer Vision and Pattern Recognition · Computer Science 2024-12-10 Dongchen Han , Yifan Pu , Zhuofan Xia , Yizeng Han , Xuran Pan , Xiu Li , Jiwen Lu , Shiji Song , Gao Huang

Statistical Classification via Robust Hypothesis Testing: Non-Asymptotic and Simple Bounds

We consider Bayesian multiple statistical classification problem in the case where the unknown source distributions are estimated from the labeled training sequences, then the estimates are used as nominal distributions in a robust…

Information Theory · Computer Science 2021-10-11 Hüseyin Afşer

SoftDedup: an Efficient Data Reweighting Method for Speeding Up Language Model Pre-training

The effectiveness of large language models (LLMs) is often hindered by duplicated data in their extensive pre-training datasets. Current approaches primarily focus on detecting and removing duplicates, which risks the loss of valuable…

Computation and Language · Computer Science 2024-07-10 Nan He , Weichen Xiong , Hanwen Liu , Yi Liao , Lei Ding , Kai Zhang , Guohua Tang , Xiao Han , Wei Yang