English
Related papers

Related papers: Sparse Sequence-to-Sequence Models

200 papers

Transformer-based architectures traditionally employ softmax to compute attention weights, which produces dense distributions over all tokens in a sequence. While effective in many settings, this density has been shown to be detrimental for…

Computation and Language · Computer Science 2026-03-03 Pavlo Vasylenko , Hugo Pitorro , André F. T. Martins , Marcos Treviso

Attention mechanisms have become ubiquitous in NLP. Recent architectures, notably the Transformer, learn powerful context-aware word representations through layered, multi-headed attention. The multiple heads learn diverse types of word…

Computation and Language · Computer Science 2019-09-09 Gonçalo M. Correia , Vlad Niculae , André F. T. Martins

The Softmax function is used in the final layer of nearly all existing sequence-to-sequence models for language generation. However, it is usually the slowest layer to compute which limits the vocabulary size to a subset of most frequent…

Computation and Language · Computer Science 2019-03-25 Sachin Kumar , Yulia Tsvetkov

Exponential families are widely used in machine learning; they include many distributions in continuous and discrete domains (e.g., Gaussian, Dirichlet, Poisson, and categorical distributions via the softmax transformation). Distributions…

Current sequence-to-sequence models are trained to minimize cross-entropy and use softmax to compute the locally normalized probabilities over target sequences. While this setup has led to strong results in a variety of tasks, one…

Computation and Language · Computer Science 2021-03-19 Ben Peters , André F. T. Martins

The softmax function is widely used in artificial neural networks for the multiclass classification problems, where the softmax transformation enforces the output to be positive and sum to one, and the corresponding loss function allows to…

Machine Learning · Computer Science 2021-12-24 Shaoshi Sun , Zhenyuan Zhang , BoCheng Huang , Pengbin Lei , Jianlin Su , Shengfeng Pan , Jiarun Cao

Sequence-to-Sequence models were introduced to tackle many real-life problems like machine translation, summarization, image captioning, etc. The standard optimization algorithms are mainly based on example-to-example matching like maximum…

Computation and Language · Computer Science 2018-09-05 Wenhu Chen , Guanlin Li , Shujie Liu , Zhirui Zhang , Mu Li , Ming Zhou

We simplify sentences with an attentive neural network sequence to sequence model, dubbed S4. The model includes a novel word-copy mechanism and loss function to exploit linguistic similarities between the original and simplified sentences.…

Computation and Language · Computer Science 2018-05-16 Alexander Mathews , Lexing Xie , Xuming He

Training a deep neural network requires a large amount of single-task data and involves a long time-consuming optimization phase. This is not scalable to complex, realistic environments with new unexpected changes. Humans can perform fast…

Neural and Evolutionary Computing · Computer Science 2020-09-04 Tsendsuren Munkhdalai

Sequence-to-Sequence (seq2seq) modeling has rapidly become an important general-purpose NLP tool that has proven effective for many text-generation and sequence-labeling tasks. Seq2seq builds on deep neural language modeling and inherits…

Computation and Language · Computer Science 2016-11-11 Sam Wiseman , Alexander M. Rush

Sparse linear models are one of several core tools for interpretable machine learning, a field of emerging importance as predictive models permeate decision-making in many domains. Unfortunately, sparse linear models are far less flexible…

Machine Learning · Statistics 2024-01-03 Ryan Thompson , Amir Dezfouli , Robert Kohn

Current state-of-the-art text generators build on powerful language models such as GPT-2, achieving impressive performance. However, to avoid degenerate text, they require sampling from a modified softmax, via temperature parameters or…

Computation and Language · Computer Science 2020-10-06 Pedro Henrique Martins , Zita Marinho , André F. T. Martins

Recently, Transformer networks have redefined the state of the art in many NLP tasks. However, these models suffer from quadratic computational cost in the input sequence length $n$ to compute pairwise attention in each layer. This has…

Machine Learning · Computer Science 2020-12-22 Chulhee Yun , Yin-Wen Chang , Srinadh Bhojanapalli , Ankit Singh Rawat , Sashank J. Reddi , Sanjiv Kumar

Modern neural networks are often augmented with an attention mechanism, which tells the network where to focus within the input. We propose in this paper a new framework for sparse and structured attention, building upon a smoothed max…

Machine Learning · Statistics 2019-02-26 Vlad Niculae , Mathieu Blondel

While transformer-based Large Language Models (LLMs) theoretically support massive context windows, they suffer from severe performance degradation when processing long numerical sequences. We attribute this failure to the attention…

Computation and Language · Computer Science 2026-04-10 Jie Sun , Yu Liu , Lu Han , Qiwen Deng , Xiang Shu , Yang Xiao , Xingyu Lu , Jun Zhou , Pengfei Liu , Lintao Ma , Jiancan Wu , Xiang Wang

SoftMax is a ubiquitous ingredient of modern machine learning algorithms. It maps an input vector onto a probability simplex and reweights the input by concentrating the probability mass at large entries. Yet, as a smooth approximation to…

Machine Learning · Computer Science 2025-01-09 Yuxuan Zhou , Mario Fritz , Margret Keuper

Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to $O(n \sqrt{n})$. We also…

Machine Learning · Computer Science 2019-04-25 Rewon Child , Scott Gray , Alec Radford , Ilya Sutskever

Shapelets are discriminative subsequences (or shapes) with high interpretability in time series classification. Due to the time-intensive nature of shapelet discovery, existing shapelet-based methods mainly focus on selecting discriminative…

Machine Learning · Computer Science 2025-06-04 Zhen Liu , Yicheng Luo , Boyuan Li , Emadeldeen Eldele , Min Wu , Qianli Ma

Transformers' quadratic complexity with respect to the input sequence length has motivated a body of work on efficient sparse approximations to softmax. An alternative path, used by entmax transformers, consists of having built-in exact…

Computation and Language · Computer Science 2022-04-22 Marcos Treviso , António Góis , Patrick Fernandes , Erick Fonseca , André F. T. Martins

Converting an n-dimensional vector to a probability distribution over n objects is a commonly used component in many machine learning tasks like multiclass classification, multilabel classification, attention mechanisms etc. For this,…

‹ Prev 1 2 3 10 Next ›