Related papers: Progressive Inference: Explaining Decoder-Only Seq…

A Simple Interpretable Transformer for Fine-Grained Image Classification and Analysis

We present a novel usage of Transformers to make image classification interpretable. Unlike mainstream classifiers that wait until the last fully connected layer to incorporate class information to make predictions, we investigate a…

Computer Vision and Pattern Recognition · Computer Science 2024-06-17 Dipanjyoti Paul , Arpita Chowdhury , Xinqi Xiong , Feng-Ju Chang , David Carlyn , Samuel Stevens , Kaiya L. Provost , Anuj Karpatne , Bryan Carstens , Daniel Rubenstein , Charles Stewart , Tanya Berger-Wolf , Yu Su , Wei-Lun Chao

ENTP: Encoder-only Next Token Prediction

Next-token prediction is conventionally done using decoder-only Transformers with causal attention, as this approach allows for efficient reuse of keys and values. What if we were not compute-limited, should we still use decoder-only…

Machine Learning · Computer Science 2025-02-05 Ethan Ewer , Daewon Chae , Thomas Zeng , Jinkyu Kim , Kangwook Lee

Transformer-like Inference from Optimal Control

Decoder-only transformers compute the conditional probability of the next token from a sequence of past observations. This paper derives, from first principles, inference architectures that solve the same prediction problem - and in doing…

Machine Learning · Computer Science 2026-05-18 Aditya Kudre , Heng-Sheng Chang , Prashant G. Mehta

Pervasive Attention: 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction

Current state-of-the-art machine translation systems are based on encoder-decoder architectures, that first encode the input sequence, and then generate an output sequence based on the input encoding. Both are interfaced with an attention…

Computation and Language · Computer Science 2018-11-02 Maha Elbayad , Laurent Besacier , Jakob Verbeek

Prediction De-Correlated Inference: A safe approach for post-prediction inference

In modern data analysis, it is common to use machine learning methods to predict outcomes on unlabeled datasets and then use these pseudo-outcomes in subsequent statistical inference. Inference in this setting is often called…

Methodology · Statistics 2024-11-04 Feng Gan , Wanfeng Liang , Changliang Zou

Conditional Attribute Estimation with Autoregressive Sequence Models

Generative models are often trained with a next-token prediction objective, yet many downstream applications require the ability to estimate or control sequence-level properties. Next-token prediction can lead to overfitting of local…

Artificial Intelligence · Computer Science 2026-05-15 Erica Stutz , Giacomo Marino , Daniella Meeker , Qiao Liu , Andrew J. Loza

Hybrid Decoding: Rapid Pass and Selective Detailed Correction for Sequence Models

Recently, Transformer-based encoder-decoder models have demonstrated strong performance in multilingual speech recognition. However, the decoder's autoregressive nature and large size introduce significant bottlenecks during inference.…

Audio and Speech Processing · Electrical Eng. & Systems 2025-08-28 Yunkyu Lim , Jihwan Park , Hyung Yong Kim , Hanbin Lee , Byeong-Yeol Kim

Object Recognition as Next Token Prediction

We present an approach to pose object recognition as next token prediction. The idea is to apply a language decoder that auto-regressively predicts the text tokens from image embeddings to form labels. To ground this prediction process in…

Computer Vision and Pattern Recognition · Computer Science 2024-04-02 Kaiyu Yue , Bor-Chun Chen , Jonas Geiping , Hengduo Li , Tom Goldstein , Ser-Nam Lim

DecoderLens: Layerwise Interpretation of Encoder-Decoder Transformers

In recent years, many interpretability methods have been proposed to help interpret the internal states of Transformer-models, at different levels of precision and complexity. Here, to analyze encoder-decoder Transformers, we propose a…

Computation and Language · Computer Science 2024-04-04 Anna Langedijk , Hosein Mohebbi , Gabriele Sarti , Willem Zuidema , Jaap Jumelet

Iterative Amortized Inference

Inference models are a key component in scaling variational inference to deep latent variable models, most notably as encoder networks in variational auto-encoders (VAEs). By replacing conventional optimization-based inference with a…

Machine Learning · Computer Science 2018-07-26 Joseph Marino , Yisong Yue , Stephan Mandt

MADE: Masked Autoencoder for Distribution Estimation

There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our…

Machine Learning · Computer Science 2015-06-08 Mathieu Germain , Karol Gregor , Iain Murray , Hugo Larochelle

DePass: Unified Feature Attributing by Simple Decomposed Forward Pass

Attributing the behavior of Transformer models to internal computations is a central challenge in mechanistic interpretability. We introduce DePass, a unified framework for feature attribution based on a single decomposed forward pass.…

Computation and Language · Computer Science 2025-10-27 Xiangyu Hong , Che Jiang , Kai Tian , Biqing Qi , Youbang Sun , Ning Ding , Bowen Zhou

Efficient Autoregressive Inference for Transformer Probabilistic Models

Set-based transformer models for amortized probabilistic inference and meta-learning, such as neural processes, prior-fitted networks, and tabular foundation models, excel at single-pass marginal prediction. However, many applications…

Machine Learning · Statistics 2026-04-22 Conor Hassan , Nasrulloh Loka , Cen-You Li , Daolang Huang , Paul E. Chang , Yang Yang , Francesco Silvestrin , Samuel Kaski , Luigi Acerbi

Adaptive Bi-directional Attention: Exploring Multi-Granularity Representations for Machine Reading Comprehension

Recently, the attention-enhanced multi-layer encoder, such as Transformer, has been extensively studied in Machine Reading Comprehension (MRC). To predict the answer, it is common practice to employ a predictor to draw information only from…

Computation and Language · Computer Science 2021-02-03 Nuo Chen , Fenglin Liu , Chenyu You , Peilin Zhou , Yuexian Zou

Model specification via sequential coherence and backward induction

This paper describes how to specify probability models for data analysis via a backward induction procedure. The new approach yields coherent, prior-free uncertainty assessment. After presenting some intuition-building examples, the new…

Methodology · Statistics 2015-02-24 P. Richard Hahn

Transformer-Based Model Predictive Path Integral Control

This paper presents a novel approach to improve the Model Predictive Path Integral (MPPI) control by using a transformer to initialize the mean control sequence. Traditional MPPI methods often struggle with sample efficiency and…

Robotics · Computer Science 2024-12-24 Shrenik Zinage , Vrushabh Zinage , Efstathios Bakolas

Few-Shot Segmentation Without Meta-Learning: A Good Transductive Inference Is All You Need?

We show that the way inference is performed in few-shot segmentation tasks has a substantial effect on performances -- an aspect often overlooked in the literature in favor of the meta-learning paradigm. We introduce a transductive…

Computer Vision and Pattern Recognition · Computer Science 2021-03-31 Malik Boudiaf , Hoel Kervadec , Ziko Imtiaz Masud , Pablo Piantanida , Ismail Ben Ayed , Jose Dolz

Slow-Fast Inference: Training-Free Inference Acceleration via Within-Sentence Support Stability

Long-context autoregressive decoding remains expensive because each decoding step must repeatedly process a growing history. We observe a consistent pattern during decoding: within a sentence, and more generally within a short semantically…

Machine Learning · Computer Science 2026-03-13 Xingyu Xie , Zhaochen Yu , Yue Liao , Tao Wang , Kim-Chuan Toh , Shuicheng Yan

Mask-combine Decoding and Classification Approach for Punctuation Prediction with real-time Inference Constraints

In this work, we unify several existing decoding strategies for punctuation prediction in one framework and introduce a novel strategy which utilises multiple predictions at each word across different windows. We show that significant…

Computation and Language · Computer Science 2021-12-20 Christoph Minixhofer , Ondřej Klejch , Peter Bell

Preformer: Predictive Transformer with Multi-Scale Segment-wise Correlations for Long-Term Time Series Forecasting

Transformer-based methods have shown great potential in long-term time series forecasting. However, most of these methods adopt the standard point-wise self-attention mechanism, which not only becomes intractable for long-term forecasting…

Machine Learning · Computer Science 2022-02-24 Dazhao Du , Bing Su , Zhewei Wei