Related papers: Block-State Transformers

MS-SSM: A Multi-Scale State Space Model for Efficient Sequence Modeling

State-space models (SSMs) have recently attention as an efficient alternative to computationally expensive attention-based models for sequence modeling. They rely on linear recurrences to integrate information over time, enabling fast…

Machine Learning · Computer Science 2026-01-01 Mahdi Karami , Ali Behrouz , Peilin Zhong , Razvan Pascanu , Vahab Mirrokni

Characterizing State Space Model and Hybrid Language Model Performance with Long Context

Emerging applications such as AR are driving demands for machine intelligence capable of processing continuous and/or long-context inputs on local devices. However, currently dominant models based on Transformer architecture suffers from…

Hardware Architecture · Computer Science 2026-03-24 Saptarshi Mitra , Rachid Karami , Haocheng Xu , Sitao Huang , Hyoukjun Kwon

Efficient Long Sequence Modeling via State Space Augmented Transformer

Transformer models have achieved superior performance in various natural language processing tasks. However, the quadratic computational cost of the attention mechanism limits its practicality for long sequences. There are existing…

Computation and Language · Computer Science 2022-12-19 Simiao Zuo , Xiaodong Liu , Jian Jiao , Denis Charles , Eren Manavoglu , Tuo Zhao , Jianfeng Gao

Multi-Head State Space Model for Speech Recognition

State space models (SSMs) have recently shown promising results on small-scale sequence and language modelling tasks, rivalling and outperforming many attention-based approaches. In this paper, we propose a multi-head state space (MH-SSM)…

Audio and Speech Processing · Electrical Eng. & Systems 2023-05-29 Yassir Fathullah , Chunyang Wu , Yuan Shangguan , Junteng Jia , Wenhan Xiong , Jay Mahadeokar , Chunxi Liu , Yangyang Shi , Ozlem Kalinli , Mike Seltzer , Mark J. F. Gales

State Space Model for New-Generation Network Alternative to Transformers: A Survey

In the post-deep learning era, the Transformer architecture has demonstrated its powerful performance across pre-trained big models and various downstream tasks. However, the enormous computational demands of this architecture have deterred…

Machine Learning · Computer Science 2024-04-16 Xiao Wang , Shiao Wang , Yuhe Ding , Yuehang Li , Wentao Wu , Yao Rong , Weizhe Kong , Ju Huang , Shihao Li , Haoxiang Yang , Ziwen Wang , Bo Jiang , Chenglong Li , Yaowei Wang , Yonghong Tian , Jin Tang

Transition-based Parsing with Stack-Transformers

Modeling the parser state is key to good performance in transition-based parsing. Recurrent Neural Networks considerably improved the performance of transition-based systems by modelling the global state, e.g. stack-LSTM parsers, or local…

Computation and Language · Computer Science 2020-10-22 Ramon Fernandez Astudillo , Miguel Ballesteros , Tahira Naseem , Austin Blodgett , Radu Florian

A Comparative Analysis of Contextual Representation Flow in State-Space and Transformer Architectures

State Space Models (SSMs) have recently emerged as efficient alternatives to Transformer-Based Models (TBMs) for long-sequence processing with linear scaling, yet how contextual information flows across layers in these architectures remains…

Computation and Language · Computer Science 2026-01-08 Nhat M. Hoang , Do Xuan Long , Cong-Duy Nguyen , Min-Yen Kan , Luu Anh Tuan

Structured State Space Models for In-Context Reinforcement Learning

Structured state space sequence (S4) models have recently achieved state-of-the-art performance on long-range sequence modeling tasks. These models also have fast inference speeds and parallelisable training, making them potentially useful…

Machine Learning · Computer Science 2023-11-27 Chris Lu , Yannick Schroecker , Albert Gu , Emilio Parisotto , Jakob Foerster , Satinder Singh , Feryal Behbahani

QLAM: A Quantum Long-Attention Memory Approach to Long-Sequence Token Modeling

Modeling long-range dependencies in sequential data remains a central challenge in machine learning. Transformers address this challenge through attention mechanisms, but their quadratic complexity with respect to sequence length limits…

Machine Learning · Computer Science 2026-05-14 Hoang-Quan Nguyen , Sankalp Pandey , Khoa Luu

State-Space Large Audio Language Models

Large Audio Language Models (LALM) combine the audio perception models and the Large Language Models (LLM) and show a remarkable ability to reason about the input audio, infer the meaning, and understand the intent. However, these systems…

Audio and Speech Processing · Electrical Eng. & Systems 2024-11-26 Saurabhchand Bhati , Yuan Gong , Leonid Karlinsky , Hilde Kuehne , Rogerio Feris , James Glass

Blockwise Parallel Transformer for Large Context Models

Transformers have emerged as the cornerstone of state-of-the-art natural language processing models, showcasing exceptional performance across a wide range of AI applications. However, the memory demands posed by the self-attention…

Computation and Language · Computer Science 2023-08-30 Hao Liu , Pieter Abbeel

SST: Multi-Scale Hybrid Mamba-Transformer Experts for Time Series Forecasting

Time series forecasting has made significant advances, including with Transformer-based models. The attention mechanism in Transformer effectively captures temporal dependencies by attending to all past inputs simultaneously. However, its…

Machine Learning · Computer Science 2025-11-04 Xiongxiao Xu , Canyu Chen , Yueqing Liang , Baixiang Huang , Guangji Bai , Liang Zhao , Kai Shu

Flash STU: Fast Spectral Transform Units

Recent advances in state-space model architectures have shown great promise for efficient sequence modeling, but challenges remain in balancing computational efficiency with model expressiveness. We propose the Flash STU architecture, a…

Machine Learning · Computer Science 2026-01-21 Y. Isabel Liu , Windsor Nguyen , Yagiz Devre , Evan Dogariu , Anirudha Majumdar , Elad Hazan

State Space Models are Provably Comparable to Transformers in Dynamic Token Selection

Deep neural networks based on state space models (SSMs) are attracting significant attention in sequence modeling since their computational cost is much smaller than that of Transformers. While the capabilities of SSMs have been…

Machine Learning · Statistics 2025-03-06 Naoki Nishikawa , Taiji Suzuki

Latent Speech-Text Transformer

Auto-regressive speech-text models pre-trained on interleaved text tokens and discretized speech tokens demonstrate strong speech understanding and generation, yet remain substantially less compute-efficient than text LLMs, partly due to…

Computation and Language · Computer Science 2026-03-11 Yen-Ju Lu , Yashesh Gaur , Wei Zhou , Benjamin Muller , Jesus Villalba , Najim Dehak , Luke Zettlemoyer , Gargi Ghosh , Mike Lewis , Srinivasan Iyer , Duc Le

On the Expressiveness and Length Generalization of Selective State-Space Models on Regular Languages

Selective state-space models (SSMs) are an emerging alternative to the Transformer, offering the unique advantage of parallel training and sequential inference. Although these models have shown promising performance on a variety of tasks,…

Machine Learning · Computer Science 2025-07-08 Aleksandar Terzić , Michael Hersche , Giacomo Camposampiero , Thomas Hofmann , Abu Sebastian , Abbas Rahimi

Long-Short Transformer: Efficient Transformers for Language and Vision

Transformers have achieved success in both language and vision domains. However, it is prohibitively expensive to scale them to long sequences such as long documents or high-resolution images, because self-attention mechanism has quadratic…

Computer Vision and Pattern Recognition · Computer Science 2021-12-08 Chen Zhu , Wei Ping , Chaowei Xiao , Mohammad Shoeybi , Tom Goldstein , Anima Anandkumar , Bryan Catanzaro

Convolutional State Space Models for Long-Range Spatiotemporal Modeling

Effectively modeling long spatiotemporal sequences is challenging due to the need to model complex spatial correlations and long-range temporal dependencies simultaneously. ConvLSTMs attempt to address this by updating tensor-valued states…

Machine Learning · Computer Science 2023-10-31 Jimmy T. H. Smith , Shalini De Mello , Jan Kautz , Scott W. Linderman , Wonmin Byeon

Repeat After Me: Transformers are Better than State Space Models at Copying

Transformers are the dominant architecture for sequence modeling, but there is growing interest in models that use a fixed-size latent state that does not depend on the sequence length, which we refer to as "generalized state space models"…

Machine Learning · Computer Science 2024-06-05 Samy Jelassi , David Brandfonbrener , Sham M. Kakade , Eran Malach

Block-Recurrent Transformers

We introduce the Block-Recurrent Transformer, which applies a transformer layer in a recurrent fashion along a sequence, and has linear complexity with respect to sequence length. Our recurrent cell operates on blocks of tokens rather than…

Machine Learning · Computer Science 2022-11-03 DeLesley Hutchins , Imanol Schlag , Yuhuai Wu , Ethan Dyer , Behnam Neyshabur