Related papers: Multimodal Neural Databases

Towards Multi-Modal DBMSs for Seamless Querying of Texts and Tables

In this paper, we propose Multi-Modal Databases (MMDBs), which is a new class of database systems that can seamlessly query text and tables using SQL. To enable seamless querying of textual data using SQL in an MMDB, we propose to extend…

Databases · Computer Science 2023-05-01 Matthias Urban , Carsten Binnig

A Review on Methods and Applications in Multimodal Deep Learning

Deep Learning has implemented a wide range of applications and has become increasingly popular in recent years. The goal of multimodal deep learning (MMDL) is to create models that can process and link information using various modalities.…

Machine Learning · Computer Science 2022-02-21 Jabeen Summaira , Xi Li , Amin Muhammad Shoib , Jabbar Abdul

Multimodal Large Language Models: A Survey

The exploration of multimodal language models integrates multiple data types, such as images, text, language, audio, and other heterogeneity. While the latest large language models excel in text-based tasks, they often struggle to…

Artificial Intelligence · Computer Science 2023-11-23 Jiayang Wu , Wensheng Gan , Zefeng Chen , Shicheng Wan , Philip S. Yu

Deep Multimodal Neural Architecture Search

Designing effective neural networks is fundamentally important in deep multimodal learning. Most existing works focus on a single task and design neural architectures manually, which are highly task-specific and hard to generalize to…

Computer Vision and Pattern Recognition · Computer Science 2020-10-13 Zhou Yu , Yuhao Cui , Jun Yu , Meng Wang , Dacheng Tao , Qi Tian

Multimodal Co-learning: Challenges, Applications with Datasets, Recent Advances and Future Directions

Multimodal deep learning systems which employ multiple modalities like text, image, audio, video, etc., are showing better performance in comparison with individual modalities (i.e., unimodal) systems. Multimodal machine learning involves…

Machine Learning · Computer Science 2022-01-19 Anil Rahate , Rahee Walambe , Sheela Ramanna , Ketan Kotecha

RMDL: Random Multimodel Deep Learning for Classification

The continually increasing number of complex datasets each year necessitates ever improving machine learning methods for robust and accurate categorization of these data. This paper introduces Random Multimodel Deep Learning (RMDL): a new…

Machine Learning · Computer Science 2018-06-01 Kamran Kowsari , Mojtaba Heidarysafa , Donald E. Brown , Kiana Jafari Meimandi , Laura E. Barnes

Efficient Table Retrieval and Understanding with Multimodal Large Language Models

Tabular data is frequently captured in image form across a wide range of real-world scenarios such as financial reports, handwritten records, and document scans. These visual representations pose unique challenges for machine understanding,…

Artificial Intelligence · Computer Science 2026-02-10 Zhuoyan Xu , Haoyang Fang , Boran Han , Bonan Min , Bernie Wang , Cuixiong Hu , Shuai Zhang

Recent Advances and Trends in Multimodal Deep Learning: A Review

Deep Learning has implemented a wide range of applications and has become increasingly popular in recent years. The goal of multimodal deep learning is to create models that can process and link information using various modalities. Despite…

Computer Vision and Pattern Recognition · Computer Science 2021-05-25 Jabeen Summaira , Xi Li , Amin Muhammad Shoib , Songyuan Li , Jabbar Abdul

Survey of Large Multimodal Model Datasets, Application Categories and Taxonomy

Multimodal learning, a rapidly evolving field in artificial intelligence, seeks to construct more versatile and robust systems by integrating and analyzing diverse types of data, including text, images, audio, and video. Inspired by the…

Artificial Intelligence · Computer Science 2024-12-24 Priyaranjan Pattnayak , Hitesh Laxmichand Patel , Bhargava Kumar , Amit Agarwal , Ishan Banerjee , Srikant Panda , Tejaswini Kumar

Generative Cross-Modal Retrieval: Memorizing Images in Multimodal Language Models for Retrieval and Beyond

The recent advancements in generative language models have demonstrated their ability to memorize knowledge from documents and recall knowledge to respond to user queries effectively. Building upon this capability, we propose to enable…

Multimedia · Computer Science 2024-02-19 Yongqi Li , Wenjie Wang , Leigang Qu , Liqiang Nie , Wenjie Li , Tat-Seng Chua

DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search

Multimodal Large Language Models (MLLMs) in real-world applications require access to external knowledge sources and must remain responsive to the dynamic and ever-changing real-world information in order to address information-seeking and…

Computer Vision and Pattern Recognition · Computer Science 2025-10-15 Kartik Narayan , Yang Xu , Tian Cao , Kavya Nerella , Vishal M. Patel , Navid Shiee , Peter Grasch , Chao Jia , Yinfei Yang , Zhe Gan

MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs

State-of-the-art retrieval models typically address a straightforward search scenario, in which retrieval tasks are fixed (e.g., finding a passage to answer a specific question) and only a single modality is supported for both queries and…

Computation and Language · Computer Science 2025-02-25 Sheng-Chieh Lin , Chankyu Lee , Mohammad Shoeybi , Jimmy Lin , Bryan Catanzaro , Wei Ping

MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction

Universal multimodal embedding models have achieved great success in capturing semantic relevance between queries and candidates. However, current methods either condense queries and candidates into a single vector, potentially limiting the…

Information Retrieval · Computer Science 2026-04-08 Zilin Xiao , Qi Ma , Mengting Gu , Chun-cheng Jason Chen , Xintao Chen , Vicente Ordonez , Vijai Mohan

MuMUR : Multilingual Multimodal Universal Retrieval

Multi-modal retrieval has seen tremendous progress with the development of vision-language models. However, further improving these models require additional labelled data which is a huge manual effort. In this paper, we propose a framework…

Computer Vision and Pattern Recognition · Computer Science 2023-09-26 Avinash Madasu , Estelle Aflalo , Gabriela Ben Melech Stan , Shachar Rosenman , Shao-Yen Tseng , Gedas Bertasius , Vasudev Lal

MindBench: A Comprehensive Benchmark for Mind Map Structure Recognition and Analysis

Multimodal Large Language Models (MLLM) have made significant progress in the field of document analysis. Despite this, existing benchmarks typically focus only on extracting text and simple layout information, neglecting the complex…

Computer Vision and Pattern Recognition · Computer Science 2024-07-04 Lei Chen , Feng Yan , Yujie Zhong , Shaoxiang Chen , Zequn Jie , Lin Ma

A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks

In an era defined by the explosive growth of data and rapid technological advancements, Multimodal Large Language Models (MLLMs) stand at the forefront of artificial intelligence (AI) systems. Designed to seamlessly integrate diverse data…

Artificial Intelligence · Computer Science 2024-08-05 Jiaqi Wang , Hanqi Jiang , Yiheng Liu , Chong Ma , Xu Zhang , Yi Pan , Mengyuan Liu , Peiran Gu , Sichen Xia , Wenjun Li , Yutong Zhang , Zihao Wu , Zhengliang Liu , Tianyang Zhong , Bao Ge , Tuo Zhang , Ning Qiang , Xintao Hu , Xi Jiang , Xin Zhang , Wei Zhang , Dinggang Shen , Tianming Liu , Shu Zhang

M3DR: Towards Universal Multilingual Multimodal Document Retrieval

Multimodal document retrieval systems have shown strong progress in aligning visual and textual content for semantic search. However, most existing approaches remain heavily English-centric, limiting their effectiveness in multilingual…

Information Retrieval · Computer Science 2025-12-04 Adithya S Kolavi , Vyoman Jain

Dynamic Memory Networks for Visual and Textual Question Answering

Neural network architectures with memory and attention mechanisms exhibit certain reasoning capabilities required for question answering. One such architecture, the dynamic memory network (DMN), obtained high accuracy on a variety of…

Neural and Evolutionary Computing · Computer Science 2016-03-07 Caiming Xiong , Stephen Merity , Richard Socher

Multimodal Table Understanding

Although great progress has been made by previous table understanding methods including recent approaches based on large language models (LLMs), they rely heavily on the premise that given tables must be converted into a certain text…

Computation and Language · Computer Science 2024-06-13 Mingyu Zheng , Xinwei Feng , Qingyi Si , Qiaoqiao She , Zheng Lin , Wenbin Jiang , Weiping Wang

Knowledge-Aware Reasoning over Multimodal Semi-structured Tables

Existing datasets for tabular question answering typically focus exclusively on text within cells. However, real-world data is inherently multimodal, often blending images such as symbols, faces, icons, patterns, and charts with textual…

Computation and Language · Computer Science 2024-08-27 Suyash Vardhan Mathur , Jainit Sushil Bafna , Kunal Kartik , Harshita Khandelwal , Manish Shrivastava , Vivek Gupta , Mohit Bansal , Dan Roth