Related papers: Funnel-Transformer: Filtering out Sequential Redun…

Revisiting Funnel Transformers for Modern LLM Architectures with Comprehensive Ablations in Training and Inference Configurations

Transformer-based Large Language Models, which suffer from high computational costs, advance so quickly that techniques proposed to streamline earlier iterations are not guaranteed to benefit more modern models. Building upon the Funnel…

Computation and Language · Computer Science 2025-04-07 DongHyun Choi , Lucas Spangher , Chris Hidey , Peter Grabowski , Ramy Eskander

Memory-Efficient Fine-Tuning of Transformers via Token Selection

Fine-tuning provides an effective means to specialize pre-trained models for various downstream tasks. However, fine-tuning often incurs high memory overhead, especially for large transformer-based models, such as LLMs. While existing…

Computation and Language · Computer Science 2025-02-03 Antoine Simoulin , Namyong Park , Xiaoyi Liu , Grey Yang

Do We Need Distinct Representations for Every Speech Token? Unveiling and Exploiting Redundancy in Large Speech Language Models

Large Speech Language Models (LSLMs) typically operate at high token rates (tokens/s) to ensure acoustic fidelity, yet this results in sequence lengths that far exceed the underlying semantic content, incurring prohibitive inference costs.…

Computation and Language · Computer Science 2026-04-09 Bajian Xiang , Tingwei Guo , Xuan Chen , Yang Han

FL-Tuning: Layer Tuning for Feed-Forward Network in Transformer

Prompt tuning is an emerging way of adapting pre-trained language models to downstream tasks. However, the existing studies are mainly to add prompts to the input sequence. This way would not work as expected due to the intermediate…

Computation and Language · Computer Science 2022-07-01 Jingping Liu , Yuqiu Song , Kui Xue , Hongli Sun , Chao Wang , Lihan Chen , Haiyun Jiang , Jiaqing Liang , Tong Ruan

Efficient Transformers with Dynamic Token Pooling

Transformers achieve unrivalled performance in modelling language, but remain inefficient in terms of memory and time complexity. A possible remedy is to reduce the sequence length in the intermediate layers by pooling fixed-length segments…

Computation and Language · Computer Science 2023-10-25 Piotr Nawrot , Jan Chorowski , Adrian Łańcucki , Edoardo M. Ponti

Chunk, Align, Select: A Simple Long-sequence Processing Method for Transformers

Although dominant in natural language processing, transformer-based models remain challenged by the task of long-sequence processing, because the computational cost of self-attention operations in transformers swells quadratically with the…

Computation and Language · Computer Science 2024-07-08 Jiawen Xie , Pengyu Cheng , Xiao Liang , Yong Dai , Nan Du

Learned Token Pruning for Transformers

Deploying transformer models in practice is challenging due to their inference cost, which scales quadratically with input sequence length. To address this, we present a novel Learned Token Pruning (LTP) method which adaptively removes…

Computation and Language · Computer Science 2022-06-06 Sehoon Kim , Sheng Shen , David Thorsley , Amir Gholami , Woosuk Kwon , Joseph Hassoun , Kurt Keutzer

Latency Adjustable Transformer Encoder for Language Understanding

Adjusting the latency, power, and accuracy of natural language understanding models is a desirable objective of an efficient architecture. This paper proposes an efficient Transformer architecture that adjusts the inference computational…

Computation and Language · Computer Science 2024-09-20 Sajjad Kachuee , Mohammad Sharifkhani

Continual Transformers: Redundancy-Free Attention for Online Inference

Transformers in their common form are inherently limited to operate on whole token sequences rather than on one token at a time. Consequently, their use during online inference on time-series data entails considerable redundancy due to the…

Artificial Intelligence · Computer Science 2023-06-28 Lukas Hedegaard , Arian Bakhtiarnia , Alexandros Iosifidis

FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance

Recently, Multi-modal Large Language Models (MLLMs) have shown remarkable effectiveness for multi-modal tasks due to their abilities to generate and understand cross-modal data. However, processing long sequences of visual tokens extracted…

Computer Vision and Pattern Recognition · Computer Science 2025-04-11 Haicheng Wang , Zhemeng Yu , Gabriele Spadaro , Chen Ju , Victor Quétu , Shuai Xiao , Enzo Tartaglione

HT-Transformer: Event Sequences Classification by Accumulating Prefix Information with History Tokens

Deep learning has achieved remarkable success in modeling sequential data, including event sequences, temporal point processes, and irregular time series. Recently, transformers have largely replaced recurrent networks in these tasks.…

Machine Learning · Computer Science 2025-08-05 Ivan Karpukhin , Andrey Savchenko

Hansel: Output Length Controlling Framework for Large Language Models

Despite the great success of large language models (LLMs), efficiently controlling the length of the output sequence still remains a challenge. In this paper, we propose Hansel, an efficient framework for length control in LLMs without…

Computation and Language · Computer Science 2024-12-19 Seoha Song , Junhyun Lee , Hyeonmok Ko

BudgetLongformer: Can we Cheaply Pretrain a SotA Legal Language Model From Scratch?

Pretrained transformer models have achieved state-of-the-art results in many tasks and benchmarks recently. Many state-of-the-art Language Models (LMs), however, do not scale well above the threshold of 512 input tokens. In specialized…

Computation and Language · Computer Science 2022-12-01 Joel Niklaus , Daniele Giofré

ResFormer: All-Time Reservoir Memory for Long Sequence Classification

Sequence classification is essential in NLP for understanding and categorizing language patterns in tasks like sentiment analysis, intent detection, and topic classification. Transformer-based models, despite achieving state-of-the-art…

Computation and Language · Computer Science 2025-09-30 Hongbo Liu , Jia Xu

Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality

In Transformer architectures, tokens\textemdash discrete units derived from raw data\textemdash are formed by segmenting inputs into fixed-length chunks. Each token is then mapped to an embedding, enabling parallel attention computations…

Machine Learning · Computer Science 2026-01-14 Zhenglun Kong , Yize Li , Fanhu Zeng , Lei Xin , Shvat Messica , Xue Lin , Pu Zhao , Manolis Kellis , Hao Tang , Marinka Zitnik

Fast-FNet: Accelerating Transformer Encoder Models via Efficient Fourier Layers

Transformer-based language models utilize the attention mechanism for substantial performance improvements in almost all natural language processing (NLP) tasks. Similar attention structures are also extensively studied in several other…

Computation and Language · Computer Science 2023-05-17 Nurullah Sevim , Ege Ozan Özyedek , Furkan Şahinuç , Aykut Koç

PartialFormer: Modeling Part Instead of Whole for Machine Translation

The design choices in Transformer feed-forward neural networks have resulted in significant computational and parameter overhead. In this work, we emphasize the importance of hidden dimensions in designing lightweight FFNs, a factor often…

Computation and Language · Computer Science 2024-06-06 Tong Zheng , Bei Li , Huiwen Bao , Jiale Wang , Weiqiao Shan , Tong Xiao , Jingbo Zhu

TransfoRNN: Capturing the Sequential Information in Self-Attention Representations for Language Modeling

In this paper, we describe the use of recurrent neural networks to capture sequential information from the self-attention representations to improve the Transformers. Although self-attention mechanism provides a means to exploit long…

Computation and Language · Computer Science 2021-04-06 Tze Yuang Chong , Xuyang Wang , Lin Yang , Junjie Wang

Structured Convergence in Large Language Model Representations via Hierarchical Latent Space Folding

Token representations in high-dimensional latent spaces often exhibit redundancy, limiting computational efficiency and reducing structural coherence across model layers. Hierarchical latent space folding introduces a structured…

Computation and Language · Computer Science 2025-08-11 Fenella Harcourt , Naderdel Piero , Gilbert Sutherland , Daphne Holloway , Harriet Bracknell , Julian Ormsby

Pushdown Layers: Encoding Recursive Structure in Transformer Language Models

Recursion is a prominent feature of human language, and fundamentally challenging for self-attention due to the lack of an explicit recursive-state tracking mechanism. Consequently, Transformer language models poorly capture long-tail…

Computation and Language · Computer Science 2023-10-31 Shikhar Murty , Pratyusha Sharma , Jacob Andreas , Christopher D. Manning