Related papers: Shared DIFF Transformer

DINT Transformer

DIFF Transformer addresses the issue of irrelevant context interference by introducing a differential attention mechanism that enhances the robustness of local attention. However, it has two critical limitations: the lack of global context…

Computation and Language · Computer Science 2025-01-30 Yueyang Cang , Yuhang Liu , Xiaoteng Zhang , Erlu Zhao , Li Shi

Differential Transformer

Transformer tends to overallocate attention to irrelevant context. In this work, we introduce Diff Transformer, which amplifies attention to the relevant context while canceling noise. Specifically, the differential attention mechanism…

Computation and Language · Computer Science 2025-04-08 Tianzhu Ye , Li Dong , Yuqing Xia , Yutao Sun , Yi Zhu , Gao Huang , Furu Wei

Diffuser: Efficient Transformers with Multi-hop Attention Diffusion for Long Sequences

Efficient Transformers have been developed for long sequence modeling, due to their subquadratic memory and time complexity. Sparse Transformer is a popular approach to improving the efficiency of Transformers by restricting self-attention…

Machine Learning · Computer Science 2023-02-01 Aosong Feng , Irene Li , Yuang Jiang , Rex Ying

NoiseFormer -- Noise Diffused Symmetric Attention Transformer

Transformer architecture has been very successful long runner in the field of Deep Learning (DL) and Large Language Models (LLM) because of its powerful attention-based learning and parallel-natured architecture. As the models grow gigantic…

Machine Learning · Computer Science 2026-01-21 Phani Kumar , Nyshadham , Jyothendra Varma , Polisetty V R K , Aditya Rathore

Grouped Differential Attention

The self-attention mechanism, while foundational to modern Transformer architectures, suffers from a critical inefficiency: it frequently allocates substantial attention to redundant or noisy context. Differential Attention addressed this…

Machine Learning · Computer Science 2025-10-09 Junghwan Lim , Sungmin Lee , Dongseok Kim , Wai Ting Cheung , Beomgyu Kim , Taehwan Kim , Haesol Lee , Junhyeok Lee , Dongpin Oh , Eunhwan Park

Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators

This paper identifies significant redundancy in the query-key interactions within self-attention mechanisms of diffusion transformer models, particularly during the early stages of denoising diffusion steps. In response to this observation,…

Computer Vision and Pattern Recognition · Computer Science 2024-08-13 Yifan Pu , Zhuofan Xia , Jiayi Guo , Dongchen Han , Qixiu Li , Duo Li , Yuhui Yuan , Ji Li , Yizeng Han , Shiji Song , Gao Huang , Xiu Li

DiffLoRA: Differential Low-Rank Adapters for Large Language Models

Differential Transformer has recently been proposed to improve performance in Transformer models by canceling out noise through a denoiser attention mechanism. In this work, we introduce DiffLoRA, a parameter-efficient adaptation of the…

Computation and Language · Computer Science 2025-08-01 Alexandre Misrahi , Nadezhda Chirkova , Maxime Louis , Vassilina Nikoulina

Switch Diffusion Transformer: Synergizing Denoising Tasks with Sparse Mixture-of-Experts

Diffusion models have achieved remarkable success across a range of generative tasks. Recent efforts to enhance diffusion model architectures have reimagined them as a form of multi-task learning, where each task corresponds to a denoising…

Computer Vision and Pattern Recognition · Computer Science 2024-07-11 Byeongjun Park , Hyojun Go , Jin-Young Kim , Sangmin Woo , Seokil Ham , Changick Kim

Multiformer: A Head-Configurable Transformer-Based Model for Direct Speech Translation

Transformer-based models have been achieving state-of-the-art results in several fields of Natural Language Processing. However, its direct application to speech tasks is not trivial. The nature of this sequences carries problems such as…

Computation and Language · Computer Science 2022-05-17 Gerard Sant , Gerard I. Gállego , Belen Alastruey , Marta R. Costa-Jussà

DrDiff: Dynamic Routing Diffusion with Hierarchical Attention for Breaking the Efficiency-Quality Trade-off

This paper introduces DrDiff, a novel framework for long-text generation that overcomes the efficiency-quality trade-off through three core technologies. First, we design a dynamic expert scheduling mechanism that intelligently allocates…

Computation and Language · Computer Science 2025-10-14 Jusheng Zhang , Yijia Fan , Kaitong Cai , Zimeng Huang , Xiaofei Sun , Jian Wang , Chengpei Tang , Keze Wang

Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection

Self-attention based Transformer has demonstrated the state-of-the-art performances in a number of natural language processing tasks. Self-attention is able to model long-term dependencies, but it may suffer from the extraction of…

Computation and Language · Computer Science 2019-12-30 Guangxiang Zhao , Junyang Lin , Zhiyuan Zhang , Xuancheng Ren , Qi Su , Xu Sun

FAIR: Focused Attention Is All You Need for Generative Recommendation

Recently, transformer-based generative recommendation has garnered significant attention for user behavior modeling. However, it often requires discretizing items into multi-code representations (e.g., typically four code tokens or more),…

Information Retrieval · Computer Science 2025-12-18 Longtao Xiao , Haolin Zhang , Guohao Cai , Jieming Zhu , Yifan Wang , Heng Chang , Zhenhua Dong , Xiu Li , Ruixuan Li

Integral Transformer: Denoising Attention, Not Too Much Not Too Little

Softmax self-attention often assigns disproportionate weight to semantically uninformative tokens such as special tokens and punctuation, a phenomenon known as attention noise. While recent methods like Cog Attention and the Differential…

Computation and Language · Computer Science 2025-08-27 Ivan Kobyzev , Abbas Ghaddar , Dingtao Hu , Boxing Chen

Selective Attention Improves Transformer

Unneeded elements in the attention's context degrade performance. We introduce Selective Attention, a simple parameter-free change to the standard attention mechanism which reduces attention to unneeded elements. Selective attention…

Computation and Language · Computer Science 2025-04-25 Yaniv Leviathan , Matan Kalman , Yossi Matias

U-shaped Transformer with Frequency-Band Aware Attention for Speech Enhancement

The state-of-the-art speech enhancement has limited performance in speech estimation accuracy. Recently, in deep learning, the Transformer shows the potential to exploit the long-range dependency in speech by self-attention. Therefore, it…

Sound · Computer Science 2023-05-10 Yi Li , Yang Sun , Syed Mohsen Naqvi

DRAformer: Differentially Reconstructed Attention Transformer for Time-Series Forecasting

Time-series forecasting plays an important role in many real-world scenarios, such as equipment life cycle forecasting, weather forecasting, and traffic flow forecasting. It can be observed from recent research that a variety of…

Machine Learning · Computer Science 2022-06-14 Benhan Li , Shengdong Du , Tianrui Li , Jie Hu , Zhen Jia

Understanding Differential Transformer Unchains Pretrained Self-Attentions

Differential Transformer has recently gained significant attention for its impressive empirical performance, often attributed to its ability to perform noise canceled attention. However, precisely how differential attention achieves its…

Machine Learning · Computer Science 2025-10-22 Chaerin Kong , Jiho Jang , Nojun Kwak

Improving Efficiency of Diffusion Models via Multi-Stage Framework and Tailored Multi-Decoder Architectures

Diffusion models, emerging as powerful deep generative tools, excel in various applications. They operate through a two-steps process: introducing noise into training samples and then employing a model to convert random noise into new…

Computer Vision and Pattern Recognition · Computer Science 2026-02-13 Huijie Zhang , Yifu Lu , Ismail Alkhouri , Saiprasad Ravishankar , Dogyoon Song , Qing Qu

SD-DiT: Unleashing the Power of Self-supervised Discrimination in Diffusion Transformer

Diffusion Transformer (DiT) has emerged as the new trend of generative diffusion models on image generation. In view of extremely slow convergence in typical DiT, recent breakthroughs have been driven by mask strategy that significantly…

Computer Vision and Pattern Recognition · Computer Science 2024-03-26 Rui Zhu , Yingwei Pan , Yehao Li , Ting Yao , Zhenglong Sun , Tao Mei , Chang Wen Chen

DDT: Dual-branch Deformable Transformer for Image Denoising

Transformer is beneficial for image denoising tasks since it can model long-range dependencies to overcome the limitations presented by inductive convolutional biases. However, directly applying the transformer structure to remove noise is…

Computer Vision and Pattern Recognition · Computer Science 2023-04-14 Kangliang Liu , Xiangcheng Du , Sijie Liu , Yingbin Zheng , Xingjiao Wu , Cheng Jin