Related papers: Efficient Long Sequence Encoding via Synchronizati…

Chunk, Align, Select: A Simple Long-sequence Processing Method for Transformers

Although dominant in natural language processing, transformer-based models remain challenged by the task of long-sequence processing, because the computational cost of self-attention operations in transformers swells quadratically with the…

Computation and Language · Computer Science 2024-07-08 Jiawen Xie , Pengyu Cheng , Xiao Liang , Yong Dai , Nan Du

Hierarchical Transformers Are More Efficient Language Models

Transformer models yield impressive results on many NLP and sequence modeling tasks. Remarkably, Transformers can handle long sequences which allows them to produce long coherent outputs: full paragraphs produced by GPT-3 or well-structured…

Machine Learning · Computer Science 2022-04-19 Piotr Nawrot , Szymon Tworkowski , Michał Tyrolski , Łukasz Kaiser , Yuhuai Wu , Christian Szegedy , Henryk Michalewski

LoPT: Lossless Parallel Tokenization Acceleration for Long Context Inference of Large Language Model

Long context inference scenarios have become increasingly important for large language models, yet they introduce significant computational latency. While prior research has optimized long-sequence inference through operators, model…

Computation and Language · Computer Science 2025-11-10 Wei Shao , Lingchao Zheng , Pengyu Wang , Peizhen Zheng , Jun Li , Yuwei Fan

HiPool: Modeling Long Documents Using Graph Neural Networks

Encoding long sequences in Natural Language Processing (NLP) is a challenging problem. Though recent pretraining language models achieve satisfying performances in many NLP tasks, they are still restricted by a pre-defined maximum length,…

Computation and Language · Computer Science 2023-05-16 Irene Li , Aosong Feng , Dragomir Radev , Rex Ying

Randomized Positional Encodings Boost Length Generalization of Transformers

Transformers have impressive generalization capabilities on tasks with a fixed context length. However, they fail to generalize to sequences of arbitrary length, even for seemingly simple tasks such as duplicating a string. Moreover, simply…

Machine Learning · Computer Science 2023-05-29 Anian Ruoss , Grégoire Delétang , Tim Genewein , Jordi Grau-Moya , Róbert Csordás , Mehdi Bennani , Shane Legg , Joel Veness

LAIT: Efficient Multi-Segment Encoding in Transformers with Layer-Adjustable Interaction

Transformer encoders contextualize token representations by attending to all other tokens at each layer, leading to quadratic increase in compute effort with the input length. In practice, however, the input text of many NLP tasks can be…

Computation and Language · Computer Science 2023-06-01 Jeremiah Milbauer , Annie Louis , Mohammad Javad Hosseini , Alex Fabrikant , Donald Metzler , Tal Schuster

A Survey on Transformer Context Extension: Approaches and Evaluation

Large language models (LLMs) based on Transformer have been widely applied in the filed of natural language processing (NLP), demonstrating strong performance, particularly in handling short text tasks. However, when it comes to long…

Computation and Language · Computer Science 2025-07-09 Yijun Liu , Jinzheng Yu , Yang Xu , Zhongyang Li , Qingfu Zhu

Hierarchical Token Prepending: Enhancing Information Flow in Decoder-based LLM Embeddings

Large language models produce powerful text embeddings, but their causal attention mechanism restricts the flow of information from later to earlier tokens, degrading representation quality. While recent methods attempt to solve this by…

Computation and Language · Computer Science 2025-11-20 Xueying Ding , Xingyue Huang , Mingxuan Ju , Liam Collins , Yozen Liu , Leman Akoglu , Neil Shah , Tong Zhao

ETC: Encoding Long and Structured Inputs in Transformers

Transformer models have advanced the state of the art in many Natural Language Processing (NLP) tasks. In this paper, we present a new Transformer architecture, Extended Transformer Construction (ETC), that addresses two key challenges of…

Machine Learning · Computer Science 2020-10-28 Joshua Ainslie , Santiago Ontanon , Chris Alberti , Vaclav Cvicek , Zachary Fisher , Philip Pham , Anirudh Ravula , Sumit Sanghai , Qifan Wang , Li Yang

TokenRing: An Efficient Parallelism Framework for Infinite-Context LLMs via Bidirectional Communication

Efficient parallelization of Large Language Models (LLMs) with long sequences is essential but challenging due to their significant computational and memory demands, particularly stemming from communication bottlenecks in attention…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-12-31 Zongwu Wang , Fangxin Liu , Mingshuai Li , Li Jiang

Hyperdecoders: Instance-specific decoders for multi-task NLP

We investigate input-conditioned hypernetworks for multi-tasking in NLP, generating parameter-efficient adaptations for a decoder using a hypernetwork conditioned on the output of an encoder. This approach produces a unique decoder…

Computation and Language · Computer Science 2022-10-19 Hamish Ivison , Matthew E. Peters

Sequence Length is a Domain: Length-based Overfitting in Transformer Models

Transformer-based sequence-to-sequence architectures, while achieving state-of-the-art results on a large number of NLP tasks, can still suffer from overfitting during training. In practice, this is usually countered either by applying…

Computation and Language · Computer Science 2022-01-04 Dušan Variš , Ondřej Bojar

From Anchors to Answers: A Novel Node Tokenizer for Integrating Graph Structure into Large Language Models

Enabling large language models (LLMs) to effectively process and reason with graph-structured data remains a significant challenge despite their remarkable success in natural language tasks. Current approaches either convert graph…

Artificial Intelligence · Computer Science 2025-09-03 Yanbiao Ji , Chang Liu , Xin Chen , Dan Luo , Mei Li , Yue Ding , Wenqing Lin , Hongtao Lu

An Exploration of Hierarchical Attention Transformers for Efficient Long Document Classification

Non-hierarchical sparse attention Transformer-based models, such as Longformer and Big Bird, are popular approaches to working with long documents. There are clear benefits to these approaches compared to the original Transformer in terms…

Computation and Language · Computer Science 2022-10-12 Ilias Chalkidis , Xiang Dai , Manos Fergadiotis , Prodromos Malakasiotis , Desmond Elliott

Hierarchical Learning for Generation with Long Source Sequences

One of the challenges for current sequence to sequence (seq2seq) models is processing long sequences, such as those in summarization and document level machine translation tasks. These tasks require the model to reason at the token level as…

Computation and Language · Computer Science 2021-09-20 Tobias Rohde , Xiaoxia Wu , Yinhan Liu

Hierarchical Autoregressive Transformers: Combining Byte- and Word-Level Processing for Robust, Adaptable Language Models

Tokenization is a fundamental step in natural language processing, breaking text into units that computational models can process. While learned subword tokenizers have become the de-facto standard, they present challenges such as large…

Computation and Language · Computer Science 2025-01-22 Pit Neitemeier , Björn Deiseroth , Constantin Eichenberg , Lukas Balles

Contextually Structured Token Dependency Encoding for Large Language Models

Token representation strategies within large-scale neural architectures often rely on contextually refined embeddings, yet conventional approaches seldom encode structured relationships explicitly within token interactions. Self-attention…

Computation and Language · Computer Science 2025-03-27 James Blades , Frederick Somerfield , William Langley , Susan Everingham , Maurice Witherington

Encoding-based Memory Modules for Recurrent Neural Networks

Learning to solve sequential tasks with recurrent models requires the ability to memorize long sequences and to extract task-relevant features from them. In this paper, we study the memorization subtask from the point of view of the design…

Machine Learning · Computer Science 2020-02-03 Antonio Carta , Alessandro Sperduti , Davide Bacciu

Cross-Thought for Sentence Encoder Pre-training

In this paper, we propose Cross-Thought, a novel approach to pre-training sequence encoder, which is instrumental in building reusable sequence embeddings for large-scale NLP tasks such as question answering. Instead of using the original…

Computation and Language · Computer Science 2020-10-09 Shuohang Wang , Yuwei Fang , Siqi Sun , Zhe Gan , Yu Cheng , Jing Jiang , Jingjing Liu

Long Sequence Modeling with Attention Tensorization: From Sequence to Tensor Learning

As the demand for processing extended textual data grows, the ability to handle long-range dependencies and maintain computational efficiency is more critical than ever. One of the key issues for long-sequence modeling using attention-based…

Computation and Language · Computer Science 2025-05-26 Aosong Feng , Rex Ying , Leandros Tassiulas