Related papers: Sparse Sequence-to-Sequence Models

Long-Context Generalization with Sparse Attention

Transformer-based architectures traditionally employ softmax to compute attention weights, which produces dense distributions over all tokens in a sequence. While effective in many settings, this density has been shown to be detrimental for…

Computation and Language · Computer Science 2026-03-03 Pavlo Vasylenko , Hugo Pitorro , André F. T. Martins , Marcos Treviso

Adaptively Sparse Transformers

Attention mechanisms have become ubiquitous in NLP. Recent architectures, notably the Transformer, learn powerful context-aware word representations through layered, multi-headed attention. The multiple heads learn diverse types of word…

Computation and Language · Computer Science 2019-09-09 Gonçalo M. Correia , Vlad Niculae , André F. T. Martins

Von Mises-Fisher Loss for Training Sequence to Sequence Models with Continuous Outputs

The Softmax function is used in the final layer of nearly all existing sequence-to-sequence models for language generation. However, it is usually the slowest layer to compute which limits the vocabulary size to a subset of most frequent…

Computation and Language · Computer Science 2019-03-25 Sachin Kumar , Yulia Tsvetkov

Sparse and Continuous Attention Mechanisms

Exponential families are widely used in machine learning; they include many distributions in continuous and discrete domains (e.g., Gaussian, Dirichlet, Poisson, and categorical distributions via the softmax transformation). Distributions…

Machine Learning · Computer Science 2020-10-30 André F. T. Martins , António Farinhas , Marcos Treviso , Vlad Niculae , Pedro M. Q. Aguiar , Mário A. T. Figueiredo

Smoothing and Shrinking the Sparse Seq2Seq Search Space

Current sequence-to-sequence models are trained to minimize cross-entropy and use softmax to compute the locally normalized probabilities over target sequences. While this setup has led to strong results in a variety of tasks, one…

Computation and Language · Computer Science 2021-03-19 Ben Peters , André F. T. Martins

Sparse-softmax: A Simpler and Faster Alternative Softmax Transformation

The softmax function is widely used in artificial neural networks for the multiclass classification problems, where the softmax transformation enforces the output to be positive and sum to one, and the corresponding loss function allows to…

Machine Learning · Computer Science 2021-12-24 Shaoshi Sun , Zhenyuan Zhang , BoCheng Huang , Pengbin Lei , Jianlin Su , Shengfeng Pan , Jiarun Cao

Approximate Distribution Matching for Sequence-to-Sequence Learning

Sequence-to-Sequence models were introduced to tackle many real-life problems like machine translation, summarization, image captioning, etc. The standard optimization algorithms are mainly based on example-to-example matching like maximum…

Computation and Language · Computer Science 2018-09-05 Wenhu Chen , Guanlin Li , Shujie Liu , Zhirui Zhang , Mu Li , Ming Zhou

Simplifying Sentences with Sequence to Sequence Models

We simplify sentences with an attentive neural network sequence to sequence model, dubbed S4. The model includes a novel word-copy mechanism and loss function to exploit linguistic similarities between the original and simplified sentences.…

Computation and Language · Computer Science 2018-05-16 Alexander Mathews , Lexing Xie , Xuming He

Sparse Meta Networks for Sequential Adaptation and its Application to Adaptive Language Modelling

Training a deep neural network requires a large amount of single-task data and involves a long time-consuming optimization phase. This is not scalable to complex, realistic environments with new unexpected changes. Humans can perform fast…

Neural and Evolutionary Computing · Computer Science 2020-09-04 Tsendsuren Munkhdalai

Sequence-to-Sequence Learning as Beam-Search Optimization

Sequence-to-Sequence (seq2seq) modeling has rapidly become an important general-purpose NLP tool that has proven effective for many text-generation and sequence-labeling tasks. Seq2seq builds on deep neural language modeling and inherits…

Computation and Language · Computer Science 2016-11-11 Sam Wiseman , Alexander M. Rush

The Contextual Lasso: Sparse Linear Models via Deep Neural Networks

Sparse linear models are one of several core tools for interpretable machine learning, a field of emerging importance as predictive models permeate decision-making in many domains. Unfortunately, sparse linear models are far less flexible…

Machine Learning · Statistics 2024-01-03 Ryan Thompson , Amir Dezfouli , Robert Kohn

Sparse Text Generation

Current state-of-the-art text generators build on powerful language models such as GPT-2, achieving impressive performance. However, to avoid degenerate text, they require sampling from a modified softmax, via temperature parameters or…

Computation and Language · Computer Science 2020-10-06 Pedro Henrique Martins , Zita Marinho , André F. T. Martins

$O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers

Recently, Transformer networks have redefined the state of the art in many NLP tasks. However, these models suffer from quadratic computational cost in the input sequence length $n$ to compute pairwise attention in each layer. This has…

Machine Learning · Computer Science 2020-12-22 Chulhee Yun , Yin-Wen Chang , Srinadh Bhojanapalli , Ankit Singh Rawat , Sashank J. Reddi , Sanjiv Kumar

A Regularized Framework for Sparse and Structured Neural Attention

Modern neural networks are often augmented with an attention mechanism, which tells the network where to focus within the input. We propose in this paper a new framework for sparse and structured attention, building upon a smoothed max…

Machine Learning · Statistics 2019-02-26 Vlad Niculae , Mathieu Blondel

SepSeq: A Training-Free Framework for Long Numerical Sequence Processing in LLMs

While transformer-based Large Language Models (LLMs) theoretically support massive context windows, they suffer from severe performance degradation when processing long numerical sequences. We attribute this failure to the attention…

Computation and Language · Computer Science 2026-04-10 Jie Sun , Yu Liu , Lu Han , Qiwen Deng , Xiang Shu , Yang Xiao , Xingyu Lu , Jun Zhou , Pengfei Liu , Lintao Ma , Jiancan Wu , Xiang Wang

MultiMax: Sparse and Multi-Modal Attention Learning

SoftMax is a ubiquitous ingredient of modern machine learning algorithms. It maps an input vector onto a probability simplex and reweights the input by concentrating the probability mass at large entries. Yet, as a smooth approximation to…

Machine Learning · Computer Science 2025-01-09 Yuxuan Zhou , Mario Fritz , Margret Keuper

Generating Long Sequences with Sparse Transformers

Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to $O(n \sqrt{n})$. We also…

Machine Learning · Computer Science 2019-04-25 Rewon Child , Scott Gray , Alec Radford , Ilya Sutskever

Learning Soft Sparse Shapes for Efficient Time-Series Classification

Shapelets are discriminative subsequences (or shapes) with high interpretability in time series classification. Due to the time-intensive nature of shapelet discovery, existing shapelet-based methods mainly focus on selecting discriminative…

Machine Learning · Computer Science 2025-06-04 Zhen Liu , Yicheng Luo , Boyuan Li , Emadeldeen Eldele , Min Wu , Qianli Ma

Predicting Attention Sparsity in Transformers

Transformers' quadratic complexity with respect to the input sequence length has motivated a body of work on efficient sparse approximations to softmax. An alternative path, used by entmax transformers, consists of having built-in exact…

Computation and Language · Computer Science 2022-04-22 Marcos Treviso , António Góis , Patrick Fernandes , Erick Fonseca , André F. T. Martins

On Controllable Sparse Alternatives to Softmax

Converting an n-dimensional vector to a probability distribution over n objects is a commonly used component in many machine learning tasks like multiclass classification, multilabel classification, attention mechanisms etc. For this,…

Machine Learning · Computer Science 2018-11-01 Anirban Laha , Saneem A. Chemmengath , Priyanka Agrawal , Mitesh M. Khapra , Karthik Sankaranarayanan , Harish G. Ramaswamy