Related papers: Mini-Sequence Transformer: Optimizing Intermediate…

Long Short-Term Memory Spatial Transformer Network

Spatial transformer network has been used in a layered form in conjunction with a convolutional network to enable the model to transform data spatially. In this paper, we propose a combined spatial transformer network (STN) and a Long…

Image and Video Processing · Electrical Eng. & Systems 2019-09-02 Shiyang Feng , Tianyue Chen , Hao Sun

Latent Speech-Text Transformer

Auto-regressive speech-text models pre-trained on interleaved text tokens and discretized speech tokens demonstrate strong speech understanding and generation, yet remain substantially less compute-efficient than text LLMs, partly due to…

Computation and Language · Computer Science 2026-03-11 Yen-Ju Lu , Yashesh Gaur , Wei Zhou , Benjamin Muller , Jesus Villalba , Najim Dehak , Luke Zettlemoyer , Gargi Ghosh , Mike Lewis , Srinivasan Iyer , Duc Le

HMT: Hierarchical Memory Transformer for Efficient Long Context Language Processing

Transformer-based large language models (LLM) have been widely used in language processing applications. However, due to the memory constraints of the devices, most of them restrict the context window. Even though recurrent models in…

Computation and Language · Computer Science 2025-02-07 Zifan He , Yingqi Cao , Zongyue Qin , Neha Prakriya , Yizhou Sun , Jason Cong

Efficient Machine Translation with a BiLSTM-Attention Approach

With the rapid development of Natural Language Processing (NLP) technology, the accuracy and efficiency of machine translation have become hot topics of research. This paper proposes a novel Seq2Seq model aimed at improving translation…

Computation and Language · Computer Science 2024-11-01 Yuxu Wu , Yiren Xing

Recurrent Memory Transformer

Transformer-based models show their effectiveness across multiple domains and tasks. The self-attention allows to combine information from all sequence elements into context-aware representations. However, global and local information has…

Computation and Language · Computer Science 2022-12-09 Aydar Bulatov , Yuri Kuratov , Mikhail S. Burtsev

Ultra-Long Sequence Distributed Transformer

Transformer models trained on long sequences often achieve higher accuracy than short sequences. Unfortunately, conventional transformers struggle with long sequence training due to the overwhelming computation and memory requirements.…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-11-09 Xiao Wang , Isaac Lyngaas , Aristeidis Tsaris , Peng Chen , Sajal Dash , Mayanka Chandra Shekar , Tao Luo , Hong-Jun Yoon , Mohamed Wahib , John Gouley

Mixed Sparsity Training: Achieving 4$\times$ FLOP Reduction for Transformer Pretraining

Large language models (LLMs) have made significant strides in complex tasks, yet their widespread adoption is impeded by substantial computational demands. With hundreds of billion parameters, transformer-based LLMs necessitate months of…

Machine Learning · Computer Science 2024-08-22 Pihe Hu , Shaolong Li , Longbo Huang

Minimum Latency Training of Sequence Transducers for Streaming End-to-End Speech Recognition

Sequence transducers, such as the RNN-T and the Conformer-T, are one of the most promising models of end-to-end speech recognition, especially in streaming scenarios where both latency and accuracy are important. Although various methods,…

Audio and Speech Processing · Electrical Eng. & Systems 2022-11-07 Yusuke Shinohara , Shinji Watanabe

Long-Short Transformer: Efficient Transformers for Language and Vision

Transformers have achieved success in both language and vision domains. However, it is prohibitively expensive to scale them to long sequences such as long documents or high-resolution images, because self-attention mechanism has quadratic…

Computer Vision and Pattern Recognition · Computer Science 2021-12-08 Chen Zhu , Wei Ping , Chaowei Xiao , Mohammad Shoeybi , Tom Goldstein , Anima Anandkumar , Bryan Catanzaro

Intrinsically Sparse Long Short-Term Memory Networks

Long Short-Term Memory (LSTM) has achieved state-of-the-art performances on a wide range of tasks. Its outstanding performance is guaranteed by the long-term memory ability which matches the sequential data perfectly and the gating…

Neural and Evolutionary Computing · Computer Science 2019-01-29 Shiwei Liu , Decebal Constantin Mocanu , Mykola Pechenizkiy

Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models

Recurrent LLM architectures have emerged as a promising approach for improving reasoning, as they enable multi-step computation in the embedding space without generating intermediate tokens. Models such as Ouro perform reasoning by…

Computation and Language · Computer Science 2026-05-20 Victor Conchello Vendrell , Arnau Padres Masdemont , Niccolò Grillo , Jordi Ros-Giralt , Arash Behboodi , Fabio Valerio Massoli

Long Short-Term Memory-Networks for Machine Reading

In this paper we address the question of how to render sequence-level networks better at handling structured input. We propose a machine reading simulator which processes text incrementally from left to right and performs shallow reasoning…

Computation and Language · Computer Science 2016-09-22 Jianpeng Cheng , Li Dong , Mirella Lapata

Sequence to Sequence Learning with Neural Networks

Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to…

Computation and Language · Computer Science 2014-12-16 Ilya Sutskever , Oriol Vinyals , Quoc V. Le

Performance of Three Slim Variants of The Long Short-Term Memory (LSTM) Layer

The Long Short-Term Memory (LSTM) layer is an important advancement in the field of neural networks and machine learning, allowing for effective training and impressive inference performance. LSTM-based neural networks have been…

Neural and Evolutionary Computing · Computer Science 2019-01-04 Daniel Kent , Fathi M. Salem

InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory

Large language models (LLMs) have emerged as a cornerstone in real-world applications with lengthy streaming inputs (e.g., LLM-driven agents). However, existing LLMs, pre-trained on sequences with a restricted maximum length, cannot process…

Computation and Language · Computer Science 2024-05-29 Chaojun Xiao , Pengle Zhang , Xu Han , Guangxuan Xiao , Yankai Lin , Zhengyan Zhang , Zhiyuan Liu , Maosong Sun

Latent Recurrent Transformer: Architecture Exploration, Training Strategies, and Scaling Behavior

We study Latent Recurrent Transformer (LRT), a lightweight augmentation of autoregressive transformers that reuses a high-level source-layer hidden state from the previous token as recurrent memory for the next token. Because this source…

Machine Learning · Computer Science 2026-05-27 Zeyi Huang , Xuehai He , LiLiang Ren , Yiping Wang , Baolin Peng , Hao Cheng , Shuohang Wang , Pengcheng He , Jianfeng Gao , Yong Jae Lee , Yelong Shen

Compact Recurrent Transformer with Persistent Memory

The Transformer architecture has shown significant success in many language processing and visual tasks. However, the method faces challenges in efficiently scaling to long sequences because the self-attention computation is quadratic with…

Machine Learning · Computer Science 2025-05-05 Edison Mucllari , Zachary Daniels , David Zhang , Qiang Ye

Long Short-Term Attention

Attention is an important cognition process of humans, which helps humans concentrate on critical information during their perception and learning. However, although many machine learning models can remember information of data, they have…

Machine Learning · Computer Science 2019-09-06 Guoqiang Zhong , Xin Lin , Kang Chen , Qingyang Li , Kaizhu Huang

Redefining Machine Simultaneous Interpretation: From Incremental Translation to Human-Like Strategies

Simultaneous Machine Translation (SiMT) requires high-quality translations under strict real-time constraints, which traditional encoder-decoder policies with only READ/WRITE actions cannot fully address. We extend the action space of SiMT…

Computation and Language · Computer Science 2025-09-29 Qianen Zhang , Satoshi Nakamura

MST: Masked Self-Supervised Transformer for Visual Representation

Transformer has been widely used for self-supervised pre-training in Natural Language Processing (NLP) and achieved great success. However, it has not been fully explored in visual self-supervised learning. Meanwhile, previous methods only…

Computer Vision and Pattern Recognition · Computer Science 2021-10-26 Zhaowen Li , Zhiyang Chen , Fan Yang , Wei Li , Yousong Zhu , Chaoyang Zhao , Rui Deng , Liwei Wu , Rui Zhao , Ming Tang , Jinqiao Wang