Related papers: SepTr: Separable Transformer for Audio Spectrogram…

Resource-Efficient Separation Transformer

Transformers have recently achieved state-of-the-art performance in speech separation. These models, however, are computationally demanding and require a lot of learnable parameters. This paper explores Transformer-based speech separation…

Audio and Speech Processing · Electrical Eng. & Systems 2024-01-17 Luca Della Libera , Cem Subakan , Mirco Ravanelli , Samuele Cornell , Frédéric Lepoutre , François Grondin

SOTR: Segmenting Objects with Transformers

Most recent transformer-based models show impressive performance on vision tasks, even better than Convolution Neural Networks (CNN). In this work, we present a novel, flexible, and effective transformer-based model for high-quality…

Computer Vision and Pattern Recognition · Computer Science 2021-08-18 Ruohao Guo , Dantong Niu , Liao Qu , Zhenbo Li

Attention is All You Need in Speech Separation

Recurrent Neural Networks (RNNs) have long been the dominant architecture in sequence-to-sequence learning. RNNs, however, are inherently sequential models that do not allow parallelization of their computations. Transformers are emerging…

Audio and Speech Processing · Electrical Eng. & Systems 2021-03-10 Cem Subakan , Mirco Ravanelli , Samuele Cornell , Mirko Bronzi , Jianyuan Zhong

Efficient Transformer-based Speech Enhancement Using Long Frames and STFT Magnitudes

The SepFormer architecture shows very good results in speech separation. Like other learned-encoder models, it uses short frames, as they have been shown to obtain better performance in these cases. This results in a large number of frames…

Audio and Speech Processing · Electrical Eng. & Systems 2023-06-06 Danilo de Oliveira , Tal Peer , Timo Gerkmann

Tiny-Sepformer: A Tiny Time-Domain Transformer Network for Speech Separation

Time-domain Transformer neural networks have proven their superiority in speech separation tasks. However, these models usually have a large number of network parameters, thus often encountering the problem of GPU memory explosion. In this…

Sound · Computer Science 2022-07-01 Jian Luo , Jianzong Wang , Ning Cheng , Edward Xiao , Xulong Zhang , Jing Xiao

AV-SepFormer: Cross-Attention SepFormer for Audio-Visual Target Speaker Extraction

Visual information can serve as an effective cue for target speaker extraction (TSE) and is vital to improving extraction performance. In this paper, we propose AV-SepFormer, a SepFormer-based attention dual-scale model that utilizes cross-…

Multimedia · Computer Science 2023-06-27 Jiuxin Lin , Xinyu Cai , Heinrich Dinkel , Jun Chen , Zhiyong Yan , Yongqing Wang , Junbo Zhang , Zhiyong Wu , Yujun Wang , Helen Meng

Segatron: Segment-Aware Transformer for Language Modeling and Understanding

Transformers are powerful for sequence modeling. Nearly all state-of-the-art language models and pre-trained language models are based on the Transformer architecture. However, it distinguishes sequential tokens only with the token position…

Computation and Language · Computer Science 2020-12-17 He Bai , Peng Shi , Jimmy Lin , Yuqing Xie , Luchen Tan , Kun Xiong , Wen Gao , Ming Li

RingFormer: Rethinking Recurrent Transformer with Adaptive Level Signals

Transformers have achieved great success in effectively processing sequential data such as text. Their architecture consisting of several attention and feedforward blocks can model relations between elements of a sequence in parallel…

Machine Learning · Computer Science 2025-02-20 Jaemu Heo , Eldor Fozilov , Hyunmin Song , Taehwan Kim

Separate and Reconstruct: Asymmetric Encoder-Decoder for Speech Separation

In speech separation, time-domain approaches have successfully replaced the time-frequency domain with latent sequence feature from a learnable encoder. Conventionally, the feature is separated into speaker-specific ones at the final stage…

Audio and Speech Processing · Electrical Eng. & Systems 2026-04-01 Ui-Hyeop Shin , Sangyoun Lee , Taehan Kim , Hyung-Min Park

Medical Image Segmentation Using Squeeze-and-Expansion Transformers

Medical image segmentation is important for computer-aided diagnosis. Good segmentation demands the model to see the big picture and fine details simultaneously, i.e., to learn image features that incorporate large context while keep high…

Image and Video Processing · Electrical Eng. & Systems 2021-06-03 Shaohua Li , Xiuchao Sui , Xiangde Luo , Xinxing Xu , Yong Liu , Rick Goh

Spectral Transformer Neural Processes

Time series, spatial data, and images are natural applications of Neural Processes. However, when such data exhibit strong periodicity and quasi-periodicity, existing methods often suffer from underfitting and generalise poorly beyond the…

Machine Learning · Computer Science 2026-05-12 Xianhe Chen , Hao Chen , Yingzhen Li

StagFormer: Time Staggering Transformer Decoding for RunningLayers In Parallel

Decoding in a Transformer based language model is inherently sequential as a token's embedding needs to pass through all the layers in the network before the generation of the next token can begin. In this work, we propose a new…

Machine Learning · Computer Science 2025-08-27 Dylan Cutler , Arun Kandoor , Nishanth Dikkala , Nikunj Saunshi , Xin Wang , Rina Panigrahy

Spectrogram features for audio and speech analysis

Spectrogram-based representations have grown to dominate the feature space for deep learning audio analysis systems, and are often adopted for speech analysis also. Initially, the primary motivator for spectrogram-based representations was…

Audio and Speech Processing · Electrical Eng. & Systems 2026-03-17 Ian McLoughlin , Lam Pham , Yan Song , Xiaoxiao Miao , Huy Phan , Pengfei Cai , Qing Gu , Jiang Nan , Haoyu Song , Donny Soh

Sequential Multi-Frame Neural Beamforming for Speech Separation and Enhancement

This work introduces sequential neural beamforming, which alternates between neural network based spectral separation and beamforming based spatial separation. Our neural networks for separation use an advanced convolutional architecture…

Sound · Computer Science 2020-11-05 Zhong-Qiu Wang , Hakan Erdogan , Scott Wisdom , Kevin Wilson , Desh Raj , Shinji Watanabe , Zhuo Chen , John R. Hershey

SpecTran: Spectral-Aware Transformer-based Adapter for LLM-Enhanced Sequential Recommendation

Traditional sequential recommendation (SR) models learn low-dimensional item ID embeddings from user-item interactions, often overlooking textual information such as item titles or descriptions. Recent advances in Large Language Models…

Information Retrieval · Computer Science 2026-04-27 Yu Cui , Feng Liu , Zhaoxiang Wang , Changwang Zhang , Jun Wang , Can Wang , Jiawei Chen

SepFormer: Coarse-to-fine Separator Regression Network for Table Structure Recognition

The automated reconstruction of the logical arrangement of tables from image data, termed Table Structure Recognition (TSR), is fundamental for semantic data extraction. Recently, researchers have explored a wide range of techniques to…

Computer Vision and Pattern Recognition · Computer Science 2025-06-30 Nam Quan Nguyen , Xuan Phong Pham , Tuan-Anh Tran

Moving Speaker Separation via Parallel Spectral-Spatial Processing

Multi-channel speech separation in dynamic environments is challenging as time-varying spatial and spectral features evolve at different temporal scales. Existing methods typically employ sequential architectures, forcing a single network…

Audio and Speech Processing · Electrical Eng. & Systems 2026-02-27 Yuzhu Wang , Archontis Politis , Konstantinos Drossos , Tuomas Virtanen

Spectron: Target Speaker Extraction using Conditional Transformer with Adversarial Refinement

Recently, attention-based transformers have become a de facto standard in many deep learning applications including natural language processing, computer vision, signal processing, etc.. In this paper, we propose a transformer-based…

Sound · Computer Science 2024-09-04 Tathagata Bandyopadhyay

SpecTNT: a Time-Frequency Transformer for Music Audio

Transformers have drawn attention in the MIR field for their remarkable performance shown in natural language processing and computer vision. However, prior works in the audio processing domain mostly use Transformer as a temporal feature…

Sound · Computer Science 2021-10-26 Wei-Tsung Lu , Ju-Chiang Wang , Minz Won , Keunwoo Choi , Xuchen Song

DasFormer: Deep Alternating Spectrogram Transformer for Multi/Single-Channel Speech Separation

For the task of speech separation, previous study usually treats multi-channel and single-channel scenarios as two research tracks with specialized solutions developed respectively. Instead, we propose a simple and unified architecture -…

Sound · Computer Science 2023-03-15 Shuo Wang , Xiangyu Kong , Xiulian Peng , Mahmood Movassagh , Vinod Prakash , Yan Lu