English
Related papers

Related papers: Dimba: Transformer-Mamba Diffusion Models

200 papers

Diffusion models have achieved great success in image generation, with the backbone evolving from U-Net to Vision Transformers. However, the computational cost of Transformers is quadratic to the number of tokens, leading to significant…

Computer Vision and Pattern Recognition · Computer Science 2024-07-11 Yao Teng , Yue Wu , Han Shi , Xuefei Ning , Guohao Dai , Yu Wang , Zhenguo Li , Xihui Liu

In recent developments, the Mamba architecture, known for its selective state space approach, has shown potential in the efficient modeling of long sequences. However, its application in image generation remains underexplored. Traditional…

Computer Vision and Pattern Recognition · Computer Science 2024-05-28 Shentong Mo , Yapeng Tian

The goal of style transfer is, given a content image and a style source, generating a new image preserving the content but with the artistic representation of the style source. Most of the state-of-the-art architectures use transformers or…

Computer Vision and Pattern Recognition · Computer Science 2025-04-15 Filippo Botti , Alex Ergasti , Leonardo Rossi , Tomaso Fontanini , Claudio Ferrari , Massimo Bertozzi , Andrea Prati

We present Jamba, a new base large language model based on a novel hybrid Transformer-Mamba mixture-of-experts (MoE) architecture. Specifically, Jamba interleaves blocks of Transformer and Mamba layers, enjoying the benefits of both model…

We introduce a novel state-space architecture for diffusion models, effectively harnessing spatial and frequency information to enhance the inductive bias towards local features in input images for image generation tasks. While state-space…

Computer Vision and Pattern Recognition · Computer Science 2025-04-14 Hao Phung , Quan Dao , Trung Dao , Hoang Phan , Dimitris Metaxas , Anh Tran

Transformer-based architectures have become the backbone of both uni-modal and multi-modal foundation models, largely due to their scalability via attention mechanisms, resulting in a rich ecosystem of publicly available pre-trained models…

Computer Vision and Pattern Recognition · Computer Science 2025-10-10 Xiuwei Chen , Wentao Hu , Xiao Dong , Sihao Lin , Zisheng Chen , Meng Cao , Yina Zhuang , Jianhua Han , Hang Xu , Xiaodan Liang

The diffusion model has long been plagued by scalability and quadratic complexity issues, especially within transformer-based structures. In this study, we aim to leverage the long sequence modeling capability of a State-Space Model called…

Computer Vision and Pattern Recognition · Computer Science 2024-11-26 Vincent Tao Hu , Stefan Andreas Baumann , Ming Gui , Olga Grebenkova , Pingchuan Ma , Johannes Schusterbauer , Björn Ommer

Autonomous driving systems demand trajectory planners that not only model the inherent uncertainty of future motions but also respect complex temporal dependencies and underlying physical laws. While diffusion-based generative models excel…

Robotics · Computer Science 2026-02-03 Hang Zhou , Qiang Zhang , Peiran Liu , Yihao Qin , Zhaoxu Yan , Yiding Ji

In recent years, Transformers have become the de-facto architecture for sequence modeling on text and a variety of multi-dimensional data, such as images and video. However, the use of self-attention layers in a Transformer incurs…

Computer Vision and Pattern Recognition · Computer Science 2024-07-16 Shufan Li , Harkanwar Singh , Aditya Grover

Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive (AR) generation, yet their reliance on Transformer backbones limits inference efficiency due to quadratic attention or KV-cache overhead. We…

Machine Learning · Computer Science 2026-03-02 Vaibhav Singh , Oleksiy Ostapenko , Pierre-André Noël , Eugene Belilovsky , Torsten Scholak

Transformers have become increasingly popular for image super-resolution (SR) tasks due to their strong global context modeling capabilities. However, their quadratic computational complexity necessitates the use of window-based attention…

Computer Vision and Pattern Recognition · Computer Science 2025-03-11 Aman Urumbekov , Zheng Chen

Current end-to-end multi-modal models utilize different encoders and decoders to process input and output information. This separation hinders the joint representation learning of various modalities. To unify multi-modal processing, we…

Computer Vision and Pattern Recognition · Computer Science 2025-10-20 Chunhao Lu , Qiang Lu , Meichen Dong , Jake Luo

Image generation models have encountered challenges related to scalability and quadratic complexity, primarily due to the reliance on Transformer-based backbones. In this study, we introduce MaskMamba, a novel hybrid model that combines…

Computer Vision and Pattern Recognition · Computer Science 2024-10-01 Wenchao Chen , Liqiang Niu , Ziyao Lu , Fandong Meng , Jie Zhou

As one of the most representative DL techniques, Transformer architecture has empowered numerous advanced models, especially the large language models (LLMs) that comprise billions of parameters, becoming a cornerstone in deep learning.…

Machine Learning · Computer Science 2026-04-07 Haohao Qu , Liangbo Ning , Rui An , Wenqi Fan , Tyler Derr , Hui Liu , Xin Xu , Qing Li

Multi-modality image fusion aims to integrate the merits of images from different sources and render high-quality fusion images. However, existing feature extraction and fusion methods are either constrained by inherent local reduction bias…

Computer Vision and Pattern Recognition · Computer Science 2024-09-06 Chenguang Zhu , Shan Gao , Huafeng Chen , Guangqian Guo , Chaowei Wang , Yaoxing Wang , Chen Shu Lei , Quanjiang Fan

Transformers are the cornerstone of modern large language models, but their quadratic computational complexity limits efficiency in long-sequence processing. Recent advancements in Mamba, a state space model (SSM) with linear complexity,…

Machine Learning · Computer Science 2026-01-08 Yixing Li , Ruobing Xie , Zhen Yang , Xingwu Sun , Shuaipeng Li , Weidong Han , Zhanhui Kang , Yu Cheng , Chengzhong Xu , Di Wang , Jie Jiang

Recent advancements in sequence modeling have led to the development of the Mamba architecture, noted for its selective state space approach, offering a promising avenue for efficient long sequence handling. However, its application in 3D…

Computer Vision and Pattern Recognition · Computer Science 2024-06-10 Shentong Mo

U-shaped architectures have long dominated the field of medical image segmentation, while Transformers are widely employed for modeling long-range dependencies. The former typically handles scale variations implicitly by aggregating…

Computer Vision and Pattern Recognition · Computer Science 2026-05-12 Yanhua Zhang , Ke Zhang , Jingyu Wang , Gabriella Balestra , Samanta Rosati , Yulin Wu , Wuwei Wang , Valentina Giannini

Transformer-based methods have demonstrated remarkable capabilities in 3D semantic segmentation through their powerful attention mechanisms, but the quadratic complexity limits their modeling of long-range dependencies in large-scale point…

Computer Vision and Pattern Recognition · Computer Science 2025-07-25 Xinyu Wang , Jinghua Hou , Zhe Liu , Yingying Zhu

We introduce Llamba, a family of efficient recurrent language models distilled from Llama-3.x into the Mamba architecture. The series includes Llamba-1B, Llamba-3B, and Llamba-8B, which achieve higher inference throughput and handle…

Machine Learning · Computer Science 2025-02-25 Aviv Bick , Tobias Katsch , Nimit Sohoni , Arjun Desai , Albert Gu
‹ Prev 1 2 3 10 Next ›