Related papers: Deep Mamba Multi-modal Learning
Due to its low storage cost and fast query speed, cross-modal hashing (CMH) has been widely used for similarity search in multimedia retrieval applications. However, almost all existing CMH methods are based on hand-crafted features which…
Deep image hashing aims to enable effective large-scale image retrieval by mapping the input images into simple binary hash codes through deep neural networks. More recently, Vision Mamba with linear time complexity has attracted extensive…
Depression is a common mental disorder that affects millions of people worldwide. Although promising, current multimodal methods hinge on aligned or aggregated multimodal fusion, suffering two significant limitations: (i) inefficient…
Current end-to-end multi-modal models utilize different encoders and decoders to process input and output information. This separation hinders the joint representation learning of various modalities. To unify multi-modal processing, we…
Mamba-based architectures have shown to be a promising new direction for deep learning models owing to their competitive performance and sub-quadratic deployment speed. However, current Mamba multi-modal large language models (MLLM) are…
In recent years, robust matching methods using deep learning-based approaches have been actively studied and improved in computer vision tasks. However, there remains a persistent demand for both robust and fast matching techniques. To…
Hashing has been widely applied to multimodal retrieval on large-scale multimedia data due to its efficiency in computation and storage. In this article, we propose a novel deep semantic multimodal hashing network (DSMHN) for scalable…
Multimodal Large Language Models (MLLMs) have attracted much attention for their multifunctionality. However, traditional Transformer architectures incur significant overhead due to their secondary computational complexity. To address this…
Due to the rapid advancements of sensory and computing technology, multi-modal data sources that represent the same pattern or phenomenon have attracted growing attention. As a result, finding means to explore useful information from these…
Learning the hash representation of multi-view heterogeneous data is an important task in multimedia retrieval. However, existing methods fail to effectively fuse the multi-view features and utilize the metric information provided by the…
Multiple modalities can provide more valuable information than single one by describing the same contents in various ways. Hence, it is highly expected to learn effective joint representation by fusing the features of different modalities.…
Recent works have shown the remarkable superiority of transformer models in reinforcement learning (RL), where the decision-making problem is formulated as sequential generation. Transformer-based agents could emerge with self-improvement…
Multimodal image fusion aims to integrate information from different imaging techniques to produce a comprehensive, detail-rich single image for downstream vision tasks. Existing methods based on local convolutional neural networks (CNNs)…
Cross-modality fusing complementary information from different modalities effectively improves object detection performance, making it more useful and robust for a wider range of applications. Existing fusion strategies combine different…
Multimodal fusion has made great progress in the field of remote sensing image classification due to its ability to exploit the complementary spatial-spectral information. Deep learning methods such as CNN and Transformer have been widely…
We introduce a novel deep learning method for decoding error correction codes based on the Mamba architecture, enhanced with Transformer layers. Our approach proposes a hybrid decoder that leverages Mamba's efficient sequential modeling…
Videos have become ubiquitous on the Internet. And video analysis can provide lots of information for detecting and recognizing objects as well as help people understand human actions and interactions with the real world. However, facing…
Multi-modality image fusion aims to integrate the merits of images from different sources and render high-quality fusion images. However, existing feature extraction and fusion methods are either constrained by inherent local reduction bias…
Learning social media data embedding by deep models has attracted extensive research interest as well as boomed a lot of applications, such as link prediction, classification, and cross-modal search. However, for social images which contain…
Deep learning has revolutionized biomedical research by providing sophisticated methods to handle complex, high-dimensional data. Multimodal deep learning (MDL) further enhances this capability by integrating diverse data types such as…