English
Related papers

Related papers: DiffuMamba: High-Throughput Diffusion LMs with Mam…

200 papers

Diffusion models have achieved great success in image generation, with the backbone evolving from U-Net to Vision Transformers. However, the computational cost of Transformers is quadratic to the number of tokens, leading to significant…

Computer Vision and Pattern Recognition · Computer Science 2024-07-11 Yao Teng , Yue Wu , Han Shi , Xuefei Ning , Guohao Dai , Yu Wang , Zhenguo Li , Xihui Liu

Recent Transformer-based diffusion models have shown remarkable performance, largely attributed to the ability of the self-attention mechanism to accurately capture both global and local contexts by computing all-pair interactions among…

Computer Vision and Pattern Recognition · Computer Science 2024-09-20 Yunxiang Fu , Chaoqi Chen , Yizhou Yu

This paper unveils Dimba, a new text-to-image diffusion model that employs a distinctive hybrid architecture combining Transformer and Mamba elements. Specifically, Dimba sequentially stacked blocks alternate between Transformer and Mamba…

Computer Vision and Pattern Recognition · Computer Science 2024-06-04 Zhengcong Fei , Mingyuan Fan , Changqian Yu , Debang Li , Youqiang Zhang , Junshi Huang

Linear RNN architectures, like Mamba, can be competitive with Transformer models in language modeling while having advantageous deployment characteristics. Given the focus on training large-scale Transformer models, we consider the…

Machine Learning · Computer Science 2025-06-30 Junxiong Wang , Daniele Paliotta , Avner May , Alexander M. Rush , Tri Dao

Transformers are the cornerstone of modern large language models, but their quadratic computational complexity limits efficiency in long-sequence processing. Recent advancements in Mamba, a state space model (SSM) with linear complexity,…

Machine Learning · Computer Science 2026-01-08 Yixing Li , Ruobing Xie , Zhen Yang , Xingwu Sun , Shuaipeng Li , Weidong Han , Zhanhui Kang , Yu Cheng , Chengzhong Xu , Di Wang , Jie Jiang

Diffusion Language Models (DLMs) are rapidly emerging as a powerful and promising alternative to the dominant autoregressive (AR) paradigm. By generating tokens in parallel through an iterative denoising process, DLMs possess inherent…

Computation and Language · Computer Science 2025-12-08 Tianyi Li , Mingda Chen , Bowei Guo , Zhiqiang Shen

In recent developments, the Mamba architecture, known for its selective state space approach, has shown potential in the efficient modeling of long sequences. However, its application in image generation remains underexplored. Traditional…

Computer Vision and Pattern Recognition · Computer Science 2024-05-28 Shentong Mo , Yapeng Tian

We introduce Llamba, a family of efficient recurrent language models distilled from Llama-3.x into the Mamba architecture. The series includes Llamba-1B, Llamba-3B, and Llamba-8B, which achieve higher inference throughput and handle…

Machine Learning · Computer Science 2025-02-25 Aviv Bick , Tobias Katsch , Nimit Sohoni , Arjun Desai , Albert Gu

Diffusion language models offer parallel token generation and inherent bidirectionality, promising more efficient and powerful sequence modeling compared to autoregressive approaches. However, state-of-the-art diffusion models (e.g., Dream…

Computation and Language · Computer Science 2025-10-10 Zhanqiu Hu , Jian Meng , Yash Akhauri , Mohamed S. Abdelfattah , Jae-sun Seo , Zhiru Zhang , Udit Gupta

Diffusion large language models (DLLMs) have emerged as an alternative to autoregressive (AR) decoding with appealing efficiency and modeling properties, yet their implications for agentic multi-step decision making remain underexplored. We…

Diffusion language models (dLMs) have emerged as a promising paradigm that enables parallel, non-autoregressive generation, but their learning efficiency lags behind that of autoregressive (AR) language models when trained from scratch. To…

Recent advancements in sequence modeling have led to the development of the Mamba architecture, noted for its selective state space approach, offering a promising avenue for efficient long sequence handling. However, its application in 3D…

Computer Vision and Pattern Recognition · Computer Science 2024-06-10 Shentong Mo

As one of the most representative DL techniques, Transformer architecture has empowered numerous advanced models, especially the large language models (LLMs) that comprise billions of parameters, becoming a cornerstone in deep learning.…

Machine Learning · Computer Science 2026-04-07 Haohao Qu , Liangbo Ning , Rui An , Wenqi Fan , Tyler Derr , Hui Liu , Xin Xu , Qing Li

In this work, we propose Dimple, the first Discrete Diffusion Multimodal Large Language Model (DMLLM). We observe that training with a purely discrete diffusion approach leads to significant training instability, suboptimal performance, and…

Computer Vision and Pattern Recognition · Computer Science 2025-05-27 Runpeng Yu , Xinyin Ma , Xinchao Wang

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs for text generation, with the potential to decode multiple tokens in a single iteration. However, none of the existing open-source…

Machine Learning · Computer Science 2025-08-14 Xu Wang , Chenkai Xu , Yijie Jin , Jiachun Jin , Hao Zhang , Zhijie Deng

Sequence models like Transformers and RNNs often overallocate attention to irrelevant context, leading to noisy intermediate representations. This degrades LLM capabilities by promoting hallucinations, weakening long-range and retrieval…

Machine Learning · Computer Science 2025-10-30 Nadav Schneider , Itamar Zimerman , Eliya Nachmani

Masked Diffusion Models (MDMs) offer a promising alternative to autoregressive language models by enabling parallel token generation and bidirectional context modeling. However, their inference speed is significantly limited by the…

Machine Learning · Computer Science 2026-04-08 Satyam Goyal , Kushal Patel , Tanush Mittal , Arjun Laxman

Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention,…

Machine Learning · Computer Science 2024-06-03 Albert Gu , Tri Dao

Effective reasoning is crucial to solving complex mathematical problems. Recent large language models (LLMs) have boosted performance by scaling test-time computation through long chain-of-thought reasoning. However, transformer-based…

Machine Learning · Computer Science 2025-09-10 Junxiong Wang , Wen-Ding Li , Daniele Paliotta , Daniel Ritter , Alexander M. Rush , Tri Dao

While the Mamba architecture demonstrates superior inference efficiency and competitive performance on short-context natural language processing (NLP) tasks, empirical evidence suggests its capacity to comprehend long contexts is limited…

Computation and Language · Computer Science 2025-01-03 Danlong Yuan , Jiahao Liu , Bei Li , Huishuai Zhang , Jingang Wang , Xunliang Cai , Dongyan Zhao
‹ Prev 1 2 3 10 Next ›