Related papers: Conditional Attribute Estimation with Autoregressi…

Towards Understanding the Universality of Transformers for Next-Token Prediction

Causal Transformers are trained to predict the next token for a given context. While it is widely accepted that self-attention is crucial for encoding the causal structure of sequences, the precise underlying mechanism behind this…

Machine Learning · Statistics 2025-03-04 Michael E. Sander , Gabriel Peyré

How do Transformers perform In-Context Autoregressive Learning?

Transformers have achieved state-of-the-art performance in language modeling tasks. However, the reasons behind their tremendous success are still unclear. In this paper, towards a better understanding, we train a Transformer model on a…

Machine Learning · Statistics 2024-06-06 Michael E. Sander , Raja Giryes , Taiji Suzuki , Mathieu Blondel , Gabriel Peyré

Latent Constraints: Learning to Generate Conditionally from Unconditional Generative Models

Deep generative neural networks have proven effective at both conditional and unconditional modeling of complex data distributions. Conditional generation enables interactive control, but creating new controls often requires expensive…

Machine Learning · Computer Science 2017-12-25 Jesse Engel , Matthew Hoffman , Adam Roberts

Parallel Token Prediction for Language Models

Autoregressive decoding in language models is inherently slow, generating only one token per forward pass. We propose Parallel Token Prediction (PTP), a general-purpose framework for predicting multiple tokens in a single model call. PTP…

Computation and Language · Computer Science 2026-03-06 Felix Draxler , Justus Will , Farrin Marouf Sofian , Theofanis Karaletsos , Sameer Singh , Stephan Mandt

Arbitrary Ratio Feature Compression via Next Token Prediction

Feature compression is increasingly important for improving the efficiency of downstream tasks, especially in applications involving large-scale or multi-modal data. While existing methods typically rely on dedicated models for achieving…

Computer Vision and Pattern Recognition · Computer Science 2026-02-13 Yufan Liu , Daoyuan Ren , Zhipeng Zhang , Wenyang Luo , Bing Li , Weiming Hu , Stephen Maybank

Next-token prediction capacity: general upper bounds and a lower bound for transformers

Given a sequence of tokens, such as words, the task of next-token prediction is to predict the next-token conditional probability distribution. Decoder-only transformers have become effective models for this task, but their properties are…

Machine Learning · Computer Science 2025-11-25 Liam Madden , Curtis Fox , Christos Thrampoulidis

Mechanics of Next Token Prediction with Self-Attention

Transformer-based language models are trained on large datasets to predict the next token given an input sequence. Despite this simple training objective, they have led to revolutionary advances in natural language processing. Underlying…

Machine Learning · Computer Science 2024-03-14 Yingcong Li , Yixiao Huang , M. Emrullah Ildiz , Ankit Singh Rawat , Samet Oymak

Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion

This paper presents Diffusion Forcing, a new training paradigm where a diffusion model is trained to denoise a set of tokens with independent per-token noise levels. We apply Diffusion Forcing to sequence generative modeling by training a…

Machine Learning · Computer Science 2024-12-11 Boyuan Chen , Diego Marti Monso , Yilun Du , Max Simchowitz , Russ Tedrake , Vincent Sitzmann

Enhancing next token prediction based pre-training for jet foundation models

Next token prediction is an attractive pre-training task for jet foundation models, in that it is simulation free and enables excellent generative capabilities that can transfer across datasets. Here we study multiple improvements to next…

High Energy Physics - Phenomenology · Physics 2025-12-05 Joschka Birk , Anna Hallin , Gregor Kasieczka , Nikol Madzharova , Ian Pang , David Shih

Thinking into the Future: Latent Lookahead Training for Transformers

Autoregressive language models trained with next-token prediction generate text by sampling one discrete token at a time. Although very scalable, this objective forces the model to commit at every step, preventing it from exploring or…

Computation and Language · Computer Science 2026-03-24 Lorenzo Noci , Gregor Bachmann , Seyed-Mohsen Moosavi-Dezfooli , Moin Nabi

Understanding the Emergence of Seemingly Useless Features in Next-Token Predictors

Trained Transformers have been shown to compute abstract features that appear redundant for predicting the immediate next token. We identify which components of the gradient signal from the next-token prediction objective give rise to this…

Machine Learning · Computer Science 2026-03-17 Mark Rofin , Jalal Naghiyev , Michael Hahn

Probabilistic Decomposition Transformer for Time Series Forecasting

Time series forecasting is crucial for many fields, such as disaster warning, weather prediction, and energy consumption. The Transformer-based models are considered to have revolutionized the field of sequence modeling. However, the…

Machine Learning · Computer Science 2022-11-01 Junlong Tong , Liping Xie , Wankou Yang , Kanjian Zhang

Faster Language Models with Better Multi-Token Prediction Using Tensor Decomposition

We propose a new model for multi-token prediction in transformers, aiming to enhance sampling efficiency without compromising accuracy. Motivated by recent work that predicts the probabilities of subsequent tokens using multiple heads, we…

Machine Learning · Computer Science 2025-02-11 Artem Basharin , Andrei Chertkov , Ivan Oseledets

Identifiable Token Correspondence for World Models

Token-based transformer world models have shown strong performance in visual reinforcement learning, but often suffer from temporal inconsistency in long-horizon rollouts, including object duplication, disappearance, and transmutation. A…

Machine Learning · Computer Science 2026-05-27 Youngin Kim , Ray Sun , Inho Kim , Bumsoo Park , Hyun Oh Song

Conditional Random Field Autoencoders for Unsupervised Structured Prediction

We introduce a framework for unsupervised learning of structured predictors with overlapping, global features. Each input's latent representation is predicted conditional on the observable data using a feature-rich conditional random field.…

Machine Learning · Computer Science 2014-11-11 Waleed Ammar , Chris Dyer , Noah A. Smith

Your LLM Knows the Future: Uncovering Its Multi-Token Prediction Potential

Autoregressive language models are constrained by their inherently sequential nature, generating one token at a time. This paradigm limits inference speed and parallelism, especially during later stages of generation when the direction and…

Computation and Language · Computer Science 2025-07-17 Mohammad Samragh , Arnav Kundu , David Harrison , Kumari Nishu , Devang Naik , Minsik Cho , Mehrdad Farajtabar

In-Context Imitation Learning via Next-Token Prediction

We explore how to enhance next-token prediction models to perform in-context imitation learning on a real robot, where the robot executes new tasks by interpreting contextual information provided during the input phase, without updating its…

Robotics · Computer Science 2024-10-01 Letian Fu , Huang Huang , Gaurav Datta , Lawrence Yunliang Chen , William Chung-Ho Panitch , Fangchen Liu , Hui Li , Ken Goldberg

Computational-Statistical Tradeoffs at the Next-Token Prediction Barrier: Autoregressive and Imitation Learning under Misspecification

Next-token prediction with the logarithmic loss is a cornerstone of autoregressive sequence modeling, but, in practice, suffers from error amplification, where errors in the model compound and generation quality degrades as sequence length…

Machine Learning · Computer Science 2025-02-19 Dhruv Rohatgi , Adam Block , Audrey Huang , Akshay Krishnamurthy , Dylan J. Foster

Is Conditional Generative Modeling all you need for Decision-Making?

Recent improvements in conditional generative modeling have made it possible to generate high-quality images from language descriptions alone. We investigate whether these methods can directly address the problem of sequential…

Machine Learning · Computer Science 2023-07-11 Anurag Ajay , Yilun Du , Abhi Gupta , Joshua Tenenbaum , Tommi Jaakkola , Pulkit Agrawal

HT-Transformer: Event Sequences Classification by Accumulating Prefix Information with History Tokens

Deep learning has achieved remarkable success in modeling sequential data, including event sequences, temporal point processes, and irregular time series. Recently, transformers have largely replaced recurrent networks in these tasks.…

Machine Learning · Computer Science 2025-08-05 Ivan Karpukhin , Andrey Savchenko