Related papers: Bidirectional Long-Range Parser for Sequential Dat…
Transformers achieve remarkable performance in various domains, including NLP, CV, audio processing, and graph analysis. However, they do not scale well on long sequence tasks due to their quadratic complexity w.r.t. the inputs length.…
The Transformer model is widely successful on many natural language processing tasks. However, the quadratic complexity of self-attention limit its application on long text. In this paper, adopting a fine-to-coarse attention mechanism on…
Large Language Models are prone to biased predictions and hallucinations, underlining the paramount importance of understanding their model-internal reasoning process. However, achieving faithful attributions for the entirety of a black-box…
Transformers have achieved success in both language and vision domains. However, it is prohibitively expensive to scale them to long sequences such as long documents or high-resolution images, because self-attention mechanism has quadratic…
This paper introduces a novel approach to enhance the capabilities of Large Language Models (LLMs) in processing and understanding extensive text sequences, a critical aspect in applications requiring deep comprehension and synthesis of…
Transformer models cannot easily scale to long sequences due to their O(N^2) time and space complexity. This has led to Transformer variants seeking to lower computational complexity, such as Longformer and Performer. While such models have…
As the core building block of vision transformers, attention is a powerful tool to capture long-range dependency. However, such power comes at a cost: it incurs a huge computation burden and heavy memory footprint as pairwise token…
Efficient parallelization of Large Language Models (LLMs) with long sequences is essential but challenging due to their significant computational and memory demands, particularly stemming from communication bottlenecks in attention…
Efficiently training LLMs with long sequences is important yet challenged by the massive computation and memory requirements. Sequence parallelism has been proposed to tackle these problems, but existing methods suffer from scalability or…
BERT, which stands for Bidirectional Encoder Representations from Transformers, is a recently introduced language representation model based upon the transfer learning paradigm. We extend its fine-tuning procedure to address one of its…
This paper addresses the challenges of mining latent patterns and modeling contextual dependencies in complex sequence data. A sequence pattern mining algorithm is proposed by integrating Bidirectional Long Short-Term Memory (BiLSTM) with a…
Bidirectional Encoder Representations from Transformers (BERT) has recently achieved state-of-the-art performance on a broad range of NLP tasks including sentence classification, machine translation, and question answering. The BERT model…
Since their release, Transformers have revolutionized many fields from Natural Language Understanding to Computer Vision. Document Understanding (DU) was not left behind with first Transformer based models for DU dating from late 2019.…
In real-world reinforcement learning (RL) scenarios, agents often encounter partial observability, where incomplete or noisy information obscures the true state of the environment. Partially Observable Markov Decision Processes (POMDPs) are…
The usage of transformers has grown from learning about language semantics to forming meaningful visiolinguistic representations. These architectures are often over-parametrized, requiring large amounts of computation. In this work, we…
Many studies have been conducted to improve the efficiency of Transformer from quadric to linear. Among them, the low-rank-based methods aim to learn the projection matrices to compress the sequence length. However, the projection matrices…
With the rapid development of Natural Language Processing (NLP) technology, the accuracy and efficiency of machine translation have become hot topics of research. This paper proposes a novel Seq2Seq model aimed at improving translation…
Transformer has become ubiquitous in the deep learning field. One of the key ingredients that destined its success is the self-attention mechanism, which allows fully-connected contextual encoding over input tokens. However, despite its…
Transformer encoders are widely deployed in large-scale web services for natural language understanding tasks such as text classification, semantic retrieval, and content ranking. However, their high inference latency and memory consumption…
As Transformers have become state-of-the-art models for natural language processing (NLP) tasks, the need to understand and explain their predictions is increasingly apparent. Especially in unsupervised applications, such as information…