English
Related papers

Related papers: Progressive Inference: Explaining Decoder-Only Seq…

200 papers

We present a novel usage of Transformers to make image classification interpretable. Unlike mainstream classifiers that wait until the last fully connected layer to incorporate class information to make predictions, we investigate a…

Next-token prediction is conventionally done using decoder-only Transformers with causal attention, as this approach allows for efficient reuse of keys and values. What if we were not compute-limited, should we still use decoder-only…

Machine Learning · Computer Science 2025-02-05 Ethan Ewer , Daewon Chae , Thomas Zeng , Jinkyu Kim , Kangwook Lee

Decoder-only transformers compute the conditional probability of the next token from a sequence of past observations. This paper derives, from first principles, inference architectures that solve the same prediction problem - and in doing…

Machine Learning · Computer Science 2026-05-18 Aditya Kudre , Heng-Sheng Chang , Prashant G. Mehta

Current state-of-the-art machine translation systems are based on encoder-decoder architectures, that first encode the input sequence, and then generate an output sequence based on the input encoding. Both are interfaced with an attention…

Computation and Language · Computer Science 2018-11-02 Maha Elbayad , Laurent Besacier , Jakob Verbeek

In modern data analysis, it is common to use machine learning methods to predict outcomes on unlabeled datasets and then use these pseudo-outcomes in subsequent statistical inference. Inference in this setting is often called…

Methodology · Statistics 2024-11-04 Feng Gan , Wanfeng Liang , Changliang Zou

Generative models are often trained with a next-token prediction objective, yet many downstream applications require the ability to estimate or control sequence-level properties. Next-token prediction can lead to overfitting of local…

Artificial Intelligence · Computer Science 2026-05-15 Erica Stutz , Giacomo Marino , Daniella Meeker , Qiao Liu , Andrew J. Loza

Recently, Transformer-based encoder-decoder models have demonstrated strong performance in multilingual speech recognition. However, the decoder's autoregressive nature and large size introduce significant bottlenecks during inference.…

Audio and Speech Processing · Electrical Eng. & Systems 2025-08-28 Yunkyu Lim , Jihwan Park , Hyung Yong Kim , Hanbin Lee , Byeong-Yeol Kim

We present an approach to pose object recognition as next token prediction. The idea is to apply a language decoder that auto-regressively predicts the text tokens from image embeddings to form labels. To ground this prediction process in…

Computer Vision and Pattern Recognition · Computer Science 2024-04-02 Kaiyu Yue , Bor-Chun Chen , Jonas Geiping , Hengduo Li , Tom Goldstein , Ser-Nam Lim

In recent years, many interpretability methods have been proposed to help interpret the internal states of Transformer-models, at different levels of precision and complexity. Here, to analyze encoder-decoder Transformers, we propose a…

Computation and Language · Computer Science 2024-04-04 Anna Langedijk , Hosein Mohebbi , Gabriele Sarti , Willem Zuidema , Jaap Jumelet

Inference models are a key component in scaling variational inference to deep latent variable models, most notably as encoder networks in variational auto-encoders (VAEs). By replacing conventional optimization-based inference with a…

Machine Learning · Computer Science 2018-07-26 Joseph Marino , Yisong Yue , Stephan Mandt

There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our…

Machine Learning · Computer Science 2015-06-08 Mathieu Germain , Karol Gregor , Iain Murray , Hugo Larochelle

Attributing the behavior of Transformer models to internal computations is a central challenge in mechanistic interpretability. We introduce DePass, a unified framework for feature attribution based on a single decomposed forward pass.…

Computation and Language · Computer Science 2025-10-27 Xiangyu Hong , Che Jiang , Kai Tian , Biqing Qi , Youbang Sun , Ning Ding , Bowen Zhou

Set-based transformer models for amortized probabilistic inference and meta-learning, such as neural processes, prior-fitted networks, and tabular foundation models, excel at single-pass marginal prediction. However, many applications…

Recently, the attention-enhanced multi-layer encoder, such as Transformer, has been extensively studied in Machine Reading Comprehension (MRC). To predict the answer, it is common practice to employ a predictor to draw information only from…

Computation and Language · Computer Science 2021-02-03 Nuo Chen , Fenglin Liu , Chenyu You , Peilin Zhou , Yuexian Zou

This paper describes how to specify probability models for data analysis via a backward induction procedure. The new approach yields coherent, prior-free uncertainty assessment. After presenting some intuition-building examples, the new…

Methodology · Statistics 2015-02-24 P. Richard Hahn

This paper presents a novel approach to improve the Model Predictive Path Integral (MPPI) control by using a transformer to initialize the mean control sequence. Traditional MPPI methods often struggle with sample efficiency and…

Robotics · Computer Science 2024-12-24 Shrenik Zinage , Vrushabh Zinage , Efstathios Bakolas

We show that the way inference is performed in few-shot segmentation tasks has a substantial effect on performances -- an aspect often overlooked in the literature in favor of the meta-learning paradigm. We introduce a transductive…

Computer Vision and Pattern Recognition · Computer Science 2021-03-31 Malik Boudiaf , Hoel Kervadec , Ziko Imtiaz Masud , Pablo Piantanida , Ismail Ben Ayed , Jose Dolz

Long-context autoregressive decoding remains expensive because each decoding step must repeatedly process a growing history. We observe a consistent pattern during decoding: within a sentence, and more generally within a short semantically…

Machine Learning · Computer Science 2026-03-13 Xingyu Xie , Zhaochen Yu , Yue Liao , Tao Wang , Kim-Chuan Toh , Shuicheng Yan

In this work, we unify several existing decoding strategies for punctuation prediction in one framework and introduce a novel strategy which utilises multiple predictions at each word across different windows. We show that significant…

Computation and Language · Computer Science 2021-12-20 Christoph Minixhofer , Ondřej Klejch , Peter Bell

Transformer-based methods have shown great potential in long-term time series forecasting. However, most of these methods adopt the standard point-wise self-attention mechanism, which not only becomes intractable for long-term forecasting…

Machine Learning · Computer Science 2022-02-24 Dazhao Du , Bing Su , Zhewei Wei
‹ Prev 1 2 3 10 Next ›