Related papers: Bidirectional Long-Range Parser for Sequential Dat…

FastRPB: a Scalable Relative Positional Encoding for Long Sequence Tasks

Transformers achieve remarkable performance in various domains, including NLP, CV, audio processing, and graph analysis. However, they do not scale well on long sequence tasks due to their quadratic complexity w.r.t. the inputs length.…

Machine Learning · Computer Science 2022-02-24 Maksim Zubkov , Daniil Gavrilov

BP-Transformer: Modelling Long-Range Context via Binary Partitioning

The Transformer model is widely successful on many natural language processing tasks. However, the quadratic complexity of self-attention limit its application on long text. In this paper, adopting a fine-to-coarse attention mechanism on…

Computation and Language · Computer Science 2019-11-12 Zihao Ye , Qipeng Guo , Quan Gan , Xipeng Qiu , Zheng Zhang

AttnLRP: Attention-Aware Layer-Wise Relevance Propagation for Transformers

Large Language Models are prone to biased predictions and hallucinations, underlining the paramount importance of understanding their model-internal reasoning process. However, achieving faithful attributions for the entirety of a black-box…

Computation and Language · Computer Science 2024-06-11 Reduan Achtibat , Sayed Mohammad Vakilzadeh Hatefi , Maximilian Dreyer , Aakriti Jain , Thomas Wiegand , Sebastian Lapuschkin , Wojciech Samek

Long-Short Transformer: Efficient Transformers for Language and Vision

Transformers have achieved success in both language and vision domains. However, it is prohibitively expensive to scale them to long sequences such as long documents or high-resolution images, because self-attention mechanism has quadratic…

Computer Vision and Pattern Recognition · Computer Science 2021-12-08 Chen Zhu , Wei Ping , Chaowei Xiao , Mohammad Shoeybi , Tom Goldstein , Anima Anandkumar , Bryan Catanzaro

Zebra: Extending Context Window with Layerwise Grouped Local-Global Attention

This paper introduces a novel approach to enhance the capabilities of Large Language Models (LLMs) in processing and understanding extensive text sequences, a critical aspect in applications requiring deep comprehension and synthesis of…

Computation and Language · Computer Science 2023-12-15 Kaiqiang Song , Xiaoyang Wang , Sangwoo Cho , Xiaoman Pan , Dong Yu

The NLP Task Effectiveness of Long-Range Transformers

Transformer models cannot easily scale to long sequences due to their O(N^2) time and space complexity. This has led to Transformer variants seeking to lower computational complexity, such as Longformer and Performer. While such models have…

Computation and Language · Computer Science 2024-12-10 Guanghui Qin , Yukun Feng , Benjamin Van Durme

BiFormer: Vision Transformer with Bi-Level Routing Attention

As the core building block of vision transformers, attention is a powerful tool to capture long-range dependency. However, such power comes at a cost: it incurs a huge computation burden and heavy memory footprint as pairwise token…

Computer Vision and Pattern Recognition · Computer Science 2023-03-16 Lei Zhu , Xinjiang Wang , Zhanghan Ke , Wayne Zhang , Rynson Lau

TokenRing: An Efficient Parallelism Framework for Infinite-Context LLMs via Bidirectional Communication

Efficient parallelization of Large Language Models (LLMs) with long sequences is essential but challenging due to their significant computational and memory demands, particularly stemming from communication bottlenecks in attention…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-12-31 Zongwu Wang , Fangxin Liu , Mingshuai Li , Li Jiang

LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism

Efficiently training LLMs with long sequences is important yet challenged by the massive computation and memory requirements. Sequence parallelism has been proposed to tackle these problems, but existing methods suffer from scalability or…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-06-27 Diandian Gu , Peng Sun , Qinghao Hu , Ting Huang , Xun Chen , Yingtong Xiong , Guoteng Wang , Qiaoling Chen , Shangchun Zhao , Jiarui Fang , Yonggang Wen , Tianwei Zhang , Xin Jin , Xuanzhe Liu

Hierarchical Transformers for Long Document Classification

BERT, which stands for Bidirectional Encoder Representations from Transformers, is a recently introduced language representation model based upon the transfer learning paradigm. We extend its fine-tuning procedure to address one of its…

Computation and Language · Computer Science 2019-10-25 Raghavendra Pappagari , Piotr Żelasko , Jesús Villalba , Yishay Carmiel , Najim Dehak

A Deep Learning Framework for Sequence Mining with Bidirectional LSTM and Multi-Scale Attention

This paper addresses the challenges of mining latent patterns and modeling contextual dependencies in complex sequence data. A sequence pattern mining algorithm is proposed by integrating Bidirectional Long Short-Term Memory (BiLSTM) with a…

Machine Learning · Computer Science 2025-04-22 Tao Yang , Yu Cheng , Yaokun Ren , Yujia Lou , Minggu Wei , Honghui Xin

TRANS-BLSTM: Transformer with Bidirectional LSTM for Language Understanding

Bidirectional Encoder Representations from Transformers (BERT) has recently achieved state-of-the-art performance on a broad range of NLP tasks including sentence classification, machine translation, and question answering. The BERT model…

Computation and Language · Computer Science 2020-03-17 Zhiheng Huang , Peng Xu , Davis Liang , Ajay Mishra , Bing Xiang

Long-Range Transformer Architectures for Document Understanding

Since their release, Transformers have revolutionized many fields from Natural Language Understanding to Computer Vision. Document Understanding (DU) was not left behind with first Transformer based models for DU dating from late 2019.…

Computation and Language · Computer Science 2023-09-12 Thibault Douzon , Stefan Duffner , Christophe Garcia , Jérémy Espinas

Bi-directional Recurrence Improves Transformer in Partially Observable Markov Decision Processes

In real-world reinforcement learning (RL) scenarios, agents often encounter partial observability, where incomplete or noisy information obscures the true state of the environment. Partially Observable Markov Decision Processes (POMDPs) are…

Machine Learning · Computer Science 2025-05-19 Ashok Arora , Neetesh Kumar

Adaptive Transformers for Learning Multimodal Representations

The usage of transformers has grown from learning about language semantics to forming meaningful visiolinguistic representations. These architectures are often over-parametrized, requiring large amounts of computation. In this work, we…

Computation and Language · Computer Science 2020-07-09 Prajjwal Bhargava

DBA: Efficient Transformer with Dynamic Bilinear Low-Rank Attention

Many studies have been conducted to improve the efficiency of Transformer from quadric to linear. Among them, the low-rank-based methods aim to learn the projection matrices to compress the sequence length. However, the projection matrices…

Machine Learning · Computer Science 2022-11-30 Bosheng Qin , Juncheng Li , Siliang Tang , Yueting Zhuang

Efficient Machine Translation with a BiLSTM-Attention Approach

With the rapid development of Natural Language Processing (NLP) technology, the accuracy and efficiency of machine translation have become hot topics of research. This paper proposes a novel Seq2Seq model aimed at improving translation…

Computation and Language · Computer Science 2024-11-01 Yuxu Wu , Yiren Xing

Cluster-Former: Clustering-based Sparse Transformer for Long-Range Dependency Encoding

Transformer has become ubiquitous in the deep learning field. One of the key ingredients that destined its success is the self-attention mechanism, which allows fully-connected contextual encoding over input tokens. However, despite its…

Computation and Language · Computer Science 2021-06-08 Shuohang Wang , Luowei Zhou , Zhe Gan , Yen-Chun Chen , Yuwei Fang , Siqi Sun , Yu Cheng , Jingjing Liu

SHRP: Specialized Head Routing and Pruning for Efficient Encoder Compression

Transformer encoders are widely deployed in large-scale web services for natural language understanding tasks such as text classification, semantic retrieval, and content ranking. However, their high inference latency and memory consumption…

Machine Learning · Computer Science 2025-12-25 Zeli Su , Ziyin Zhang , Wenzheng Zhang , Zhou Liu , Guixian Xu , Wentao Zhang

Explaining Text Similarity in Transformer Models

As Transformers have become state-of-the-art models for natural language processing (NLP) tasks, the need to understand and explain their predictions is increasingly apparent. Especially in unsupervised applications, such as information…

Computation and Language · Computer Science 2024-05-13 Alexandros Vasileiou , Oliver Eberle