Related papers: Factorized Multimodal Transformer for Multimodal S…

Learning Factorized Multimodal Representations

Learning multimodal representations is a fundamentally complex research problem due to the presence of multiple heterogeneous sources of information. Although the presence of multiple modalities provides additional valuable information,…

Machine Learning · Computer Science 2019-05-15 Yao-Hung Hubert Tsai , Paul Pu Liang , Amir Zadeh , Louis-Philippe Morency , Ruslan Salakhutdinov

Multimodal Transformer for Unaligned Multimodal Language Sequences

Human language is often multimodal, which comprehends a mixture of natural language, facial gestures, and acoustic behaviors. However, two major challenges in modeling such multimodal human language time-series data exist: 1) inherent data…

Computation and Language · Computer Science 2019-06-04 Yao-Hung Hubert Tsai , Shaojie Bai , Paul Pu Liang , J. Zico Kolter , Louis-Philippe Morency , Ruslan Salakhutdinov

On Difficulties of Attention Factorization through Shared Memory

Transformers have revolutionized deep learning in numerous fields, including natural language processing, computer vision, and audio processing. Their strength lies in their attention mechanism, which allows for the discovering of complex…

Machine Learning · Computer Science 2024-04-02 Uladzislau Yorsh , Martin Holeňa , Ondřej Bojar , David Herel

Multimodal Vision Transformers with Forced Attention for Behavior Analysis

Human behavior understanding requires looking at minute details in the large context of a scene containing multiple input modalities. It is necessary as it allows the design of more human-like machines. While transformer approaches have…

Computer Vision and Pattern Recognition · Computer Science 2022-12-09 Tanay Agrawal , Michal Balazia , Philipp Müller , François Brémond

ModalPrompt: Towards Efficient Multimodal Continual Instruction Tuning with Dual-Modality Guided Prompt

Large Multimodal Models (LMMs) exhibit remarkable multi-tasking ability by learning mixed instruction datasets. However, novel tasks would be encountered sequentially in dynamic world, which urges for equipping LMMs with multimodal…

Computer Vision and Pattern Recognition · Computer Science 2025-08-26 Fanhu Zeng , Fei Zhu , Haiyang Guo , Xu-Yao Zhang , Cheng-Lin Liu

An Efficient Multimodal Learning Framework to Comprehend Consumer Preferences Using BERT and Cross-Attention

Today, the acquisition of various behavioral log data has enabled deeper understanding of customer preferences and future behaviors in the marketing field. In particular, multimodal deep learning has achieved highly accurate predictions by…

Computational Engineering, Finance, and Science · Computer Science 2024-05-14 Junichiro Niimi

Sampling Foundational Transformer: A Theoretical Perspective

The versatility of self-attention mechanism earned transformers great success in almost all data modalities, with limitations on the quadratic complexity and difficulty of training. To apply transformers across different data modalities,…

Machine Learning · Computer Science 2024-08-20 Viet Anh Nguyen , Minh Lenhat , Khoa Nguyen , Duong Duc Hieu , Dao Huu Hung , Truong Son Hy

FINE: Factorized multimodal sentiment analysis via mutual INformation Estimation

Multimodal sentiment analysis remains a challenging task due to the inherent heterogeneity across modalities. Such heterogeneity often manifests as asynchronous signals, imbalanced information between modalities, and interference from…

Multimedia · Computer Science 2025-11-26 Yadong Liu , Shangfei Wang

Multimodal Infusion Tuning for Large Models

Recent advancements in large-scale models have showcased remarkable generalization capabilities in various tasks. However, integrating multimodal processing into these models presents a significant challenge, as it often comes with a high…

Multimedia · Computer Science 2024-07-17 Hao Sun , Yu Song , Xinyao Yu , Jiaqing Liu , Yen-Wei Chen , Lanfen Lin

Federated Continual Instruction Tuning

A vast amount of instruction tuning data is crucial for the impressive performance of Large Multimodal Models (LMMs), but the associated computational costs and data collection demands during supervised fine-tuning make it impractical for…

Machine Learning · Computer Science 2025-07-22 Haiyang Guo , Fanhu Zeng , Fei Zhu , Wenzhuo Liu , Da-Han Wang , Jian Xu , Xu-Yao Zhang , Cheng-Lin Liu

Sparse Fusion for Multimodal Transformers

Multimodal classification is a core task in human-centric machine learning. We observe that information is highly complementary across modalities, thus unimodal information can be drastically sparsified prior to multimodal fusion without…

Computer Vision and Pattern Recognition · Computer Science 2021-11-29 Yi Ding , Alex Rich , Mason Wang , Noah Stier , Matthew Turk , Pradeep Sen , Tobias Höllerer

Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting

Multi-horizon forecasting problems often contain a complex mix of inputs -- including static (i.e. time-invariant) covariates, known future inputs, and other exogenous time series that are only observed historically -- without any prior…

Machine Learning · Statistics 2020-09-29 Bryan Lim , Sercan O. Arik , Nicolas Loeff , Tomas Pfister

FAST: Factorizable Attention for Speeding up Transformers

Motivated by the factorization inherent in the original fast multipole method and the improved fast Gauss transform we introduce a factorable form of attention that operates efficiently in high dimensions. This approach reduces the…

Machine Learning · Computer Science 2024-02-13 Armin Gerami , Monte Hoover , Pranav S. Dulepet , Ramani Duraiswami

MMSFormer: Multimodal Transformer for Material and Semantic Segmentation

Leveraging information across diverse modalities is known to enhance performance on multimodal segmentation tasks. However, effectively fusing information from different modalities remains challenging due to the unique characteristics of…

Computer Vision and Pattern Recognition · Computer Science 2024-04-22 Md Kaykobad Reza , Ashley Prater-Bennette , M. Salman Asif

Beyond Self Attention: A Subquadratic Fourier Wavelet Transformer with Multi Modal Fusion

We revisit the use of spectral techniques to replaces the attention mechanism in Transformers through Fourier Transform based token mixing, and present a comprehensive and novel reformulation of this technique in next generation transformer…

Computation and Language · Computer Science 2025-04-24 Andrew Kiruluta , Andreas Lemos , Eric Lundy

Meta-Transformer: A Unified Framework for Multimodal Learning

Multimodal learning aims to build models that can process and relate information from multiple modalities. Despite years of development in this field, it still remains challenging to design a unified network for processing various…

Computer Vision and Pattern Recognition · Computer Science 2023-07-21 Yiyuan Zhang , Kaixiong Gong , Kaipeng Zhang , Hongsheng Li , Yu Qiao , Wanli Ouyang , Xiangyu Yue

LMR-CBT: Learning Modality-fused Representations with CB-Transformer for Multimodal Emotion Recognition from Unaligned Multimodal Sequences

Learning modality-fused representations and processing unaligned multimodal sequences are meaningful and challenging in multimodal emotion recognition. Existing approaches use directional pairwise attention or a message hub to fuse…

Computer Vision and Pattern Recognition · Computer Science 2021-12-06 Ziwang Fu , Feng Liu , Hanyang Wang , Siyuan Shen , Jiahao Zhang , Jiayin Qi , Xiangling Fu , Aimin Zhou

MANGO: Multimodal Attention-based Normalizing Flow Approach to Fusion Learning

Multimodal learning has gained much success in recent years. However, current multimodal fusion methods adopt the attention mechanism of Transformers to implicitly learn the underlying correlation of multimodal features. As a result, the…

Computer Vision and Pattern Recognition · Computer Science 2025-11-27 Thanh-Dat Truong , Christophe Bobda , Nitin Agarwal , Khoa Luu

Multi-Modal Perception Attention Network with Self-Supervised Learning for Audio-Visual Speaker Tracking

Multi-modal fusion is proven to be an effective method to improve the accuracy and robustness of speaker tracking, especially in complex scenarios. However, how to combine the heterogeneous information and exploit the complementarity of…

Computer Vision and Pattern Recognition · Computer Science 2021-12-15 Yidi Li , Hong Liu , Hao Tang

UniT: Multimodal Multitask Learning with a Unified Transformer

We propose UniT, a Unified Transformer model to simultaneously learn the most prominent tasks across different domains, ranging from object detection to natural language understanding and multimodal reasoning. Based on the transformer…

Computer Vision and Pattern Recognition · Computer Science 2021-08-19 Ronghang Hu , Amanpreet Singh