Related papers: Chunk, Align, Select: A Simple Long-sequence Proce…

AutoChunk: Automated Activation Chunk for Memory-Efficient Long Sequence Inference

Large deep learning models have achieved impressive performance across a range of applications. However, their large memory requirements, including parameter memory and activation memory, have become a significant challenge for their…

Performance · Computer Science 2024-07-10 Xuanlei Zhao , Shenggan Cheng , Guangyang Lu , Jiarui Fang , Haotian Zhou , Bin Jia , Ziming Liu , Yang You

Efficient Long Sequence Encoding via Synchronization

Pre-trained Transformer models have achieved successes in a wide range of NLP tasks, but are inefficient when dealing with long input sequences. Existing studies try to overcome this challenge via segmenting the long sequence followed by…

Computation and Language · Computer Science 2022-03-16 Xiangyang Mou , Mo Yu , Bingsheng Yao , Lifu Huang

Efficient Long-Text Understanding with Short-Text Models

Transformer-based pretrained language models (LMs) are ubiquitous across natural language understanding, but cannot be applied to long sequences such as stories, scientific articles and long documents, due to their quadratic complexity.…

Computation and Language · Computer Science 2022-12-29 Maor Ivgi , Uri Shaham , Jonathan Berant

Efficient Long Context Fine-tuning with Chunk Flow

Long context fine-tuning of large language models(LLMs) involves training on datasets that are predominantly composed of short sequences and a small proportion of longer sequences. However, existing approaches overlook this long-tail…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-14 Xiulong Yuan , Hongtao Xu , Wenting Shen , Ang Wang , Xiafei Qiu , Jie Zhang , Yuqiong Liu , Bowen Yu , Junyang Lin , Mingzhen Li , Weile Jia , Yong Li , Wei Lin

ChunkFormer: Learning Long Time Series with Multi-stage Chunked Transformer

The analysis of long sequence data remains challenging in many real-world applications. We propose a novel architecture, ChunkFormer, that improves the existing Transformer framework to handle the challenges while dealing with long time…

Machine Learning · Computer Science 2022-01-03 Yue Ju , Alka Isac , Yimin Nie

CItruS: Chunked Instruction-aware State Eviction for Long Sequence Modeling

Long sequence modeling has gained broad interest as large language models (LLMs) continue to advance. Recent research has identified that a large portion of hidden states within the key-value caches of Transformer models can be discarded…

Computation and Language · Computer Science 2024-10-10 Yu Bai , Xiyuan Zou , Heyan Huang , Sanxing Chen , Marc-Antoine Rondeau , Yang Gao , Jackie Chi Kit Cheung

Dynamic Chunking and Selection for Reading Comprehension of Ultra-Long Context in Large Language Models

Large language models (LLMs) often struggle to accurately read and comprehend extremely long texts. Current methods for improvement typically rely on splitting long contexts into fixed-length chunks. However, fixed truncation risks…

Computation and Language · Computer Science 2025-06-04 Boheng Sheng , Jiacheng Yao , Meicong Zhang , Guoxiu He

Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing

With the success of language pretraining, it is highly desirable to develop more efficient architectures of good scalability that can exploit the abundant unlabeled data at a lower cost. To improve the efficiency, we examine the…

Machine Learning · Computer Science 2020-06-08 Zihang Dai , Guokun Lai , Yiming Yang , Quoc V. Le

Learned Token Pruning for Transformers

Deploying transformer models in practice is challenging due to their inference cost, which scales quadratically with input sequence length. To address this, we present a novel Learned Token Pruning (LTP) method which adaptively removes…

Computation and Language · Computer Science 2022-06-06 Sehoon Kim , Sheng Shen , David Thorsley , Amir Gholami , Woosuk Kwon , Joseph Hassoun , Kurt Keutzer

ChuLo: Chunk-Level Key Information Representation for Long Document Understanding

Transformer-based models have achieved remarkable success in various Natural Language Processing (NLP) tasks, yet their ability to handle long documents is constrained by computational limitations. Traditional approaches, such as truncating…

Computation and Language · Computer Science 2025-08-21 Yan Li , Soyeon Caren Han , Yue Dai , Feiqi Cao

Recurrent Chunking Mechanisms for Long-Text Machine Reading Comprehension

In this paper, we study machine reading comprehension (MRC) on long texts, where a model takes as inputs a lengthy document and a question and then extracts a text span from the document as an answer. State-of-the-art models tend to use a…

Computation and Language · Computer Science 2020-05-20 Hongyu Gong , Yelong Shen , Dian Yu , Jianshu Chen , Dong Yu

Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models

Many use cases require retrieving smaller portions of text, and dense vector-based retrieval systems often perform better with shorter text segments, as the semantics are less likely to be over-compressed in the embeddings. Consequently,…

Computation and Language · Computer Science 2025-07-08 Michael Günther , Isabelle Mohr , Daniel James Williams , Bo Wang , Han Xiao

ChunkLLM: A Lightweight Pluggable Framework for Accelerating LLMs Inference

Transformer-based large models excel in natural language processing and computer vision, but face severe computational inefficiencies due to the self-attention's quadratic complexity with input tokens. Recently, researchers have proposed a…

Computation and Language · Computer Science 2026-05-26 Haojie Ouyang , Jianwei Lv , Lei Ren , Chen Wei , Xiaojie Wang , Fangxiang Feng

Vcc: Scaling Transformers to 128K Tokens or More by Prioritizing Important Tokens

Transformers are central in modern natural language processing and computer vision applications. Despite recent works devoted to reducing the quadratic cost of such models (as a function of the sequence length), dealing with ultra long…

Computation and Language · Computer Science 2023-05-30 Zhanpeng Zeng , Cole Hawkins , Mingyi Hong , Aston Zhang , Nikolaos Pappas , Vikas Singh , Shuai Zheng

COMPACT: Common-token Optimized Model Pruning Across Channels and Tokens

Making large language models (LLMs) more efficient in memory, latency, and serving cost is crucial for edge deployment, interactive applications, and sustainable inference at scale. Pruning is a promising technique, but existing pruning…

Computation and Language · Computer Science 2025-10-13 Eugene Kwek , Wenpeng Yin

Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models

Effectively processing long contexts is a critical challenge for language models. While standard Transformers are limited by quadratic complexity and poor length extrapolation, alternative architectures like sliding window attention and…

Computation and Language · Computer Science 2026-05-01 Jiaqi Leng , Xiang Hu , Junxiong Wang , Jianguo Li , Wei Wu , Yucheng Lu

Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers

Autoregressive Transformers adopted in Large Language Models (LLMs) are hard to scale to long sequences. Despite several works trying to reduce their computational cost, most of LLMs still adopt attention layers between all pairs of tokens…

Computation and Language · Computer Science 2024-06-03 Sotiris Anagnostidis , Dario Pavllo , Luca Biggio , Lorenzo Noci , Aurelien Lucchi , Thomas Hofmann

Deconvolutional Paragraph Representation Learning

Learning latent representations from long text sequences is an important first step in many natural language processing applications. Recurrent Neural Networks (RNNs) have become a cornerstone for this challenging task. However, the quality…

Computation and Language · Computer Science 2017-09-25 Yizhe Zhang , Dinghan Shen , Guoyin Wang , Zhe Gan , Ricardo Henao , Lawrence Carin

Long-Short Transformer: Efficient Transformers for Language and Vision

Transformers have achieved success in both language and vision domains. However, it is prohibitively expensive to scale them to long sequences such as long documents or high-resolution images, because self-attention mechanism has quadratic…

Computer Vision and Pattern Recognition · Computer Science 2021-12-08 Chen Zhu , Wei Ping , Chaowei Xiao , Mohammad Shoeybi , Tom Goldstein , Anima Anandkumar , Bryan Catanzaro

Synchronous Transformers for End-to-End Speech Recognition

For most of the attention-based sequence-to-sequence models, the decoder predicts the output sequence conditioned on the entire input sequence processed by the encoder. The asynchronous problem between the encoding and decoding makes these…

Audio and Speech Processing · Electrical Eng. & Systems 2020-02-25 Zhengkun Tian , Jiangyan Yi , Ye Bai , Jianhua Tao , Shuai Zhang , Zhengqi Wen