Related papers: Multimodal Grounding for Language Processing

Mind with Eyes: from Language Reasoning to Multimodal Reasoning

Language models have recently advanced into the realm of reasoning, yet it is through multimodal reasoning that we can fully unlock the potential to achieve more comprehensive, human-like cognitive capabilities. This survey provides a…

Computation and Language · Computer Science 2025-03-25 Zhiyu Lin , Yifei Gao , Xian Zhao , Yunfan Yang , Jitao Sang

Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning (v1)

Reasoning is central to human intelligence, enabling structured problem-solving across diverse tasks. Recent advances in large language models (LLMs) have greatly enhanced their reasoning abilities in arithmetic, commonsense, and symbolic…

Computation and Language · Computer Science 2025-11-26 Jing Bi , Susan Liang , Xiaofei Zhou , Pinxin Liu , Junjia Guo , Yunlong Tang , Luchuan Song , Chao Huang , Ali Vosoughi , Guangyu Sun , Jinxi He , Jiarui Wu , Shu Yang , Daoan Zhang , Chen Chen , Lianggong Bruce Wen , Zhang Liu , Jiebo Luo , Chenliang Xu

Multimodal Large Language Models: A Survey

The exploration of multimodal language models integrates multiple data types, such as images, text, language, audio, and other heterogeneity. While the latest large language models excel in text-based tasks, they often struggle to…

Artificial Intelligence · Computer Science 2023-11-23 Jiayang Wu , Wensheng Gan , Zefeng Chen , Shicheng Wan , Philip S. Yu

Recent Advances and Trends in Multimodal Deep Learning: A Review

Deep Learning has implemented a wide range of applications and has become increasingly popular in recent years. The goal of multimodal deep learning is to create models that can process and link information using various modalities. Despite…

Computer Vision and Pattern Recognition · Computer Science 2021-05-25 Jabeen Summaira , Xi Li , Amin Muhammad Shoib , Songyuan Li , Jabbar Abdul

Multimodal Routing: Improving Local and Global Interpretability of Multimodal Language Analysis

The human language can be expressed through multiple sources of information known as modalities, including tones of voice, facial gestures, and spoken language. Recent multimodal learning with strong performances on human-centric tasks such…

Computation and Language · Computer Science 2020-10-06 Yao-Hung Hubert Tsai , Martin Q. Ma , Muqiao Yang , Ruslan Salakhutdinov , Louis-Philippe Morency

Multimodal Embeddings from Language Models

Word embeddings such as ELMo have recently been shown to model word semantics with greater efficacy through contextualized learning on large-scale language corpora, resulting in significant improvement in state of the art across many…

Computation and Language · Computer Science 2019-09-11 Shao-Yen Tseng , Panayiotis Georgiou , Shrikanth Narayanan

The Revolution of Multimodal Large Language Models: A Survey

Connecting text and visual modalities plays an essential role in generative intelligence. For this reason, inspired by the success of large language models, significant research efforts are being devoted to the development of Multimodal…

Computer Vision and Pattern Recognition · Computer Science 2024-06-07 Davide Caffagni , Federico Cocchi , Luca Barsellotti , Nicholas Moratelli , Sara Sarto , Lorenzo Baraldi , Lorenzo Baraldi , Marcella Cornia , Rita Cucchiara

A Survey Of Cross-lingual Word Embedding Models

Cross-lingual representations of words enable us to reason about word meaning in multilingual contexts and are a key facilitator of cross-lingual transfer when developing natural language processing models for low-resource languages. In…

Computation and Language · Computer Science 2019-10-08 Sebastian Ruder , Ivan Vulić , Anders Søgaard

Syntax-Guided Transformers: Elevating Compositional Generalization and Grounding in Multimodal Environments

Compositional generalization, the ability of intelligent models to extrapolate understanding of components to novel compositions, is a fundamental yet challenging facet in AI research, especially within multimodal environments. In this…

Computation and Language · Computer Science 2023-11-09 Danial Kamali , Parisa Kordjamshidi

Investigating Inner Properties of Multimodal Representation and Semantic Compositionality with Brain-based Componential Semantics

Multimodal models have been proven to outperform text-based approaches on learning semantic representations. However, it still remains unclear what properties are encoded in multimodal representations, in what aspects do they outperform the…

Computation and Language · Computer Science 2017-11-23 Shaonan Wang , Jiajun Zhang , Nan Lin , Chengqing Zong

Learning Multimodal Word Representation via Dynamic Fusion Methods

Multimodal models have been proven to outperform text-based models on learning semantic word representations. Almost all previous multimodal models typically treat the representations from different modalities equally. However, it is…

Computation and Language · Computer Science 2018-01-03 Shaonan Wang , Jiajun Zhang , Chengqing Zong

Learning Multi-Modal Word Representation Grounded in Visual Context

Representing the semantics of words is a long-standing problem for the natural language processing community. Most methods compute word semantics given their textual context in large corpora. More recently, researchers attempted to…

Computation and Language · Computer Science 2017-11-10 Éloi Zablocki , Benjamin Piwowarski , Laure Soulier , Patrick Gallinari

Multimodal Machine Translation through Visuals and Speech

Multimodal machine translation involves drawing information from more than one modality, based on the assumption that the additional modalities will contain useful alternative views of the input data. The most prominent tasks in this area…

Computation and Language · Computer Science 2019-12-02 Umut Sulubacak , Ozan Caglayan , Stig-Arne Grönroos , Aku Rouhe , Desmond Elliott , Lucia Specia , Jörg Tiedemann

Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design computer agents with intelligent capabilities such as understanding, reasoning, and learning through integrating multiple communicative…

Machine Learning · Computer Science 2023-02-21 Paul Pu Liang , Amir Zadeh , Louis-Philippe Morency

A Review on Methods and Applications in Multimodal Deep Learning

Deep Learning has implemented a wide range of applications and has become increasingly popular in recent years. The goal of multimodal deep learning (MMDL) is to create models that can process and link information using various modalities.…

Machine Learning · Computer Science 2022-02-21 Jabeen Summaira , Xi Li , Amin Muhammad Shoib , Jabbar Abdul

What is Multimodality?

The last years have shown rapid developments in the field of multimodal machine learning, combining e.g., vision, text or speech. In this position paper we explain how the field uses outdated definitions of multimodality that prove unfit…

Artificial Intelligence · Computer Science 2021-08-23 Letitia Parcalabescu , Nils Trost , Anette Frank

A Survey of Multimodal Large Language Model from A Data-centric Perspective

Multimodal large language models (MLLMs) enhance the capabilities of standard large language models by integrating and processing data from multiple modalities, including text, vision, audio, video, and 3D environments. Data plays a pivotal…

Artificial Intelligence · Computer Science 2024-07-19 Tianyi Bai , Hao Liang , Binwang Wan , Yanran Xu , Xi Li , Shiyu Li , Ling Yang , Bozhou Li , Yifan Wang , Bin Cui , Ping Huang , Jiulong Shan , Conghui He , Binhang Yuan , Wentao Zhang

A Survey on Multi-modal Machine Translation: Tasks, Methods and Challenges

In recent years, multi-modal machine translation has attracted significant interest in both academia and industry due to its superior performance. It takes both textual and visual modalities as inputs, leveraging visual context to tackle…

Computation and Language · Computer Science 2024-05-24 Huangjun Shen , Liangying Shao , Wenbo Li , Zhibin Lan , Zhanyu Liu , Jinsong Su

Multimodal Conversational AI: A Survey of Datasets and Approaches

As humans, we experience the world with all our senses or modalities (sound, sight, touch, smell, and taste). We use these modalities, particularly sight and touch, to convey and interpret specific meanings. Multimodal expressions are…

Machine Learning · Computer Science 2022-05-17 Anirudh Sundar , Larry Heck

Learning Zero-Shot Multifaceted Visually Grounded Word Embeddings via Multi-Task Training

Language grounding aims at linking the symbolic representation of language (e.g., words) into the rich perceptual knowledge of the outside world. The general approach is to embed both textual and visual information into a common space -the…

Computation and Language · Computer Science 2021-09-15 Hassan Shahmohammadi , Hendrik P. A. Lensch , R. Harald Baayen