Related papers: LAIT: Efficient Multi-Segment Encoding in Transfor…

Latency Adjustable Transformer Encoder for Language Understanding

Adjusting the latency, power, and accuracy of natural language understanding models is a desirable objective of an efficient architecture. This paper proposes an efficient Transformer architecture that adjusts the inference computational…

Computation and Language · Computer Science 2024-09-20 Sajjad Kachuee , Mohammad Sharifkhani

LT2: Linear-Time Looped Transformers

Looped Transformers (LT) have emerged as a powerful architecture by iterating their layers multiple times before decoding the final token. However, pairing them with full attention retains quadratic complexity, making them computationally…

Machine Learning · Computer Science 2026-05-26 Chunyuan Deng , Yizhe Zhang , Rui-Jie Zhu , Yuanyuan Xu , Jiarui Liu , T. S. Eugene Ng , Hanjie Chen

Less is More: Pay Less Attention in Vision Transformers

Transformers have become one of the dominant architectures in deep learning, particularly as a powerful alternative to convolutional neural networks (CNNs) in computer vision. However, Transformer training and inference in previous works…

Computer Vision and Pattern Recognition · Computer Science 2021-12-24 Zizheng Pan , Bohan Zhuang , Haoyu He , Jing Liu , Jianfei Cai

FIT: Far-reaching Interleaved Transformers

We present FIT: a transformer-based architecture with efficient self-attention and adaptive computation. Unlike original transformers, which operate on a single sequence of data tokens, we divide the data tokens into groups, with each group…

Machine Learning · Computer Science 2023-05-26 Ting Chen , Lala Li

Latent-attention Based Transformer for Near ML Polar Decoding in Short-code Regime

Transformer architectures have emerged as promising deep learning (DL) tools for modeling complex sequence-to-sequence interactions in channel decoding. However, current transformer-based decoders for error correction codes (ECCs)…

Signal Processing · Electrical Eng. & Systems 2025-07-22 Hongzhi Zhu , Wei Xu , Xiaohu You

Efficient Long Sequence Encoding via Synchronization

Pre-trained Transformer models have achieved successes in a wide range of NLP tasks, but are inefficient when dealing with long input sequences. Existing studies try to overcome this challenge via segmenting the long sequence followed by…

Computation and Language · Computer Science 2022-03-16 Xiangyang Mou , Mo Yu , Bingsheng Yao , Lifu Huang

Inner Loop Inference for Pretrained Transformers: Unlocking Latent Capabilities Without Training

Deep Learning architectures, and in particular Transformers, are conventionally viewed as a composition of layers. These layers are actually often obtained as the sum of two contributions: a residual path that copies the input and the…

Machine Learning · Computer Science 2026-03-03 Jonathan Lys , Vincent Gripon , Bastien Pasdeloup , Axel Marmoret , Lukas Mauch , Fabien Cardinaux , Ghouthi Boukli Hacene

LAWCAT: Efficient Distillation from Quadratic to Linear Attention with Convolution across Tokens for Long Context Modeling

Although transformer architectures have achieved state-of-the-art performance across diverse domains, their quadratic computational complexity with respect to sequence length remains a significant bottleneck, particularly for…

Computation and Language · Computer Science 2025-11-05 Zeyu Liu , Souvik Kundu , Lianghao Jiang , Anni Li , Srikanth Ronanki , Sravan Bodapati , Gourav Datta , Peter A. Beerel

ETC: Encoding Long and Structured Inputs in Transformers

Transformer models have advanced the state of the art in many Natural Language Processing (NLP) tasks. In this paper, we present a new Transformer architecture, Extended Transformer Construction (ETC), that addresses two key challenges of…

Machine Learning · Computer Science 2020-10-28 Joshua Ainslie , Santiago Ontanon , Chris Alberti , Vaclav Cvicek , Zachary Fisher , Philip Pham , Anirudh Ravula , Sumit Sanghai , Qifan Wang , Li Yang

Efficient Adaptive Transformer: An Empirical Study and Reproducible Framework

The Efficient Adaptive Transformer (EAT) framework unifies three adaptive efficiency techniques - progressive token pruning, sparse attention, and dynamic early exiting - into a single, reproducible architecture for input-adaptive…

Computation and Language · Computer Science 2025-10-16 Jan Miller

Fast-FNet: Accelerating Transformer Encoder Models via Efficient Fourier Layers

Transformer-based language models utilize the attention mechanism for substantial performance improvements in almost all natural language processing (NLP) tasks. Similar attention structures are also extensively studied in several other…

Computation and Language · Computer Science 2023-05-17 Nurullah Sevim , Ege Ozan Özyedek , Furkan Şahinuç , Aykut Koç

One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers

Diffusion transformers (DiTs) achieve high generative quality but lock FLOPs to image resolution, limiting principled latency-quality trade-offs, and allocate computation uniformly across input spatial tokens, wasting resource allocation to…

Computer Vision and Pattern Recognition · Computer Science 2026-03-13 Moayed Haji-Ali , Willi Menapace , Ivan Skorokhodov , Dogyun Park , Anil Kag , Michael Vasilkovsky , Sergey Tulyakov , Vicente Ordonez , Aliaksandr Siarohin

Adaptive Large Language Models By Layerwise Attention Shortcuts

Transformer architectures are the backbone of the modern AI revolution. However, they are based on simply stacking the same blocks in dozens of layers and processing information sequentially from one block to another. In this paper, we…

Computation and Language · Computer Science 2024-12-24 Prateek Verma , Mert Pilanci

Towards Incremental Transformers: An Empirical Analysis of Transformer Models for Incremental NLU

Incremental processing allows interactive systems to respond based on partial inputs, which is a desirable property e.g. in dialogue agents. The currently popular Transformer architecture inherently processes sequences as a whole,…

Computation and Language · Computer Science 2024-05-03 Patrick Kahardipraja , Brielen Madureira , David Schlangen

A Neural ODE Interpretation of Transformer Layers

Transformer layers, which use an alternating pattern of multi-head attention and multi-layer perceptron (MLP) layers, provide an effective tool for a variety of machine learning problems. As the transformer layers use residual connections…

Machine Learning · Computer Science 2022-12-13 Yaofeng Desmond Zhong , Tongtao Zhang , Amit Chakraborty , Biswadip Dey

Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking

Large language models (LLMs) face inherent performance bottlenecks under parameter constraints, particularly in processing critical tokens that demand complex reasoning. Empirical analysis reveals challenging tokens induce abrupt gradient…

Computation and Language · Computer Science 2025-02-25 Yilong Chen , Junyuan Shang , Zhenyu Zhang , Yanxi Xie , Jiawei Sheng , Tingwen Liu , Shuohuan Wang , Yu Sun , Hua Wu , Haifeng Wang

Attend First, Consolidate Later: On the Importance of Attention in Different LLM Layers

In decoder-based LLMs, the representation of a given layer serves two purposes: as input to the next layer during the computation of the current token; and as input to the attention mechanism of future tokens. In this work, we show that the…

Computation and Language · Computer Science 2024-11-01 Amit Ben-Artzy , Roy Schwartz

Lattice-Based Transformer Encoder for Neural Machine Translation

Neural machine translation (NMT) takes deterministic sequences for source representations. However, either word-level or subword-level segmentations have multiple choices to split a source sequence with different word segmentors or…

Computation and Language · Computer Science 2019-06-05 Fengshun Xiao , Jiangtong Li , Hai Zhao , Rui Wang , Kehai Chen

Multi-layer Learnable Attention Mask for Multimodal Tasks

While the Self-Attention mechanism in the Transformer model has proven to be effective in many domains, we observe that it is less effective in more diverse settings (e.g. multimodality) due to the varying granularity of each token and the…

Computer Vision and Pattern Recognition · Computer Science 2024-06-06 Wayner Barrios , SouYoung Jin

Byte Latent Transformer: Patches Scale Better Than Tokens

We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inference efficiency and robustness. BLT…

Computation and Language · Computer Science 2024-12-16 Artidoro Pagnoni , Ram Pasunuru , Pedro Rodriguez , John Nguyen , Benjamin Muller , Margaret Li , Chunting Zhou , Lili Yu , Jason Weston , Luke Zettlemoyer , Gargi Ghosh , Mike Lewis , Ari Holtzman , Srinivasan Iyer