English
Related papers

Related papers: Sequential Compositional Generalization in Multimo…

200 papers

Compositional generalization, the ability of intelligent models to extrapolate understanding of components to novel compositions, is a fundamental yet challenging facet in AI research, especially within multimodal environments. In this…

Computation and Language · Computer Science 2023-11-09 Danial Kamali , Parisa Kordjamshidi

Leveraging the compositional nature of our world to expedite learning and facilitate generalization is a hallmark of human perception. In machine learning, on the other hand, achieving compositional generalization has proven to be an…

Machine Learning · Computer Science 2023-07-13 Thaddäus Wiedemer , Prasanna Mayilvahanan , Matthias Bethge , Wieland Brendel

Compositional understanding is crucial for human intelligence, yet it remains unclear whether contemporary vision models exhibit it. The dominant machine learning paradigm is built on the premise that scaling data and model sizes will…

Machine Learning · Computer Science 2025-07-10 Arnas Uselis , Andrea Dittadi , Seong Joon Oh

The visual world is fundamentally compositional. Visual scenes are defined by the composition of objects and their relations. Hence, it is essential for computer vision systems to reflect and exploit this compositionality to achieve robust…

Computer Vision and Pattern Recognition · Computer Science 2025-04-01 Shuhao Fu , Andrew Jun Lee , Anna Wang , Ida Momennejad , Trevor Bihl , Hongjing Lu , Taylor W. Webb

Compositional generalization, representing the model's ability to generate text with new attribute combinations obtained by recombining single attributes from the training data, is a crucial property for multi-aspect controllable text…

Computation and Language · Computer Science 2024-06-04 Tianqi Zhong , Zhaoyi Li , Quan Wang , Linqi Song , Ying Wei , Defu Lian , Zhendong Mao

Multimodal models have been proven to outperform text-based approaches on learning semantic representations. However, it still remains unclear what properties are encoded in multimodal representations, in what aspects do they outperform the…

Computation and Language · Computer Science 2017-11-23 Shaonan Wang , Jiajun Zhang , Nan Lin , Chengqing Zong

Data-to-text generation involves transforming structured data, often represented as predicate-argument tuples, into coherent textual descriptions. Despite recent advances, systems still struggle when confronted with unseen combinations of…

Computation and Language · Computer Science 2023-12-06 Xinnuo Xu , Ivan Titov , Mirella Lapata

Multi-modal music generation, using multiple modalities like text, images, and video alongside musical scores and audio as guidance, is an emerging research area with broad applications. This paper reviews this field, categorizing music…

Sound · Computer Science 2026-03-09 Shuyu Li , Shulei Ji , Zihao Wang , Songruoyao Wu , Jiaxing Yu , Kejun Zhang

Compositional generalization-a key open challenge in modern machine learning-requires models to predict unknown combinations of known concepts. However, assessing compositional generalization remains a fundamental challenge due to the lack…

Machine Learning · Computer Science 2025-11-06 Giacomo Camposampiero , Pietro Barbiero , Michael Hersche , Roger Wattenhofer , Abbas Rahimi

In-context learning has shown great success in i.i.d semantic parsing splits, where the training and test sets are drawn from the same distribution. In this setup, models are typically prompted with demonstrations that are similar to the…

Computation and Language · Computer Science 2023-06-27 Itay Levy , Ben Bogin , Jonathan Berant

Temporal grounding in videos aims to localize one target video segment that semantically corresponds to a given query sentence. Thanks to the semantic diversity of natural language descriptions, temporal grounding allows activity grounding…

Computer Vision and Pattern Recognition · Computer Science 2022-03-29 Juncheng Li , Junlin Xie , Long Qian , Linchao Zhu , Siliang Tang , Fei Wu , Yi Yang , Yueting Zhuang , Xin Eric Wang

Sequence-to-sequence (seq2seq) models are prevalent in semantic parsing, but have been found to struggle at out-of-distribution compositional generalization. While specialized model architectures and pre-training of seq2seq models have been…

Computation and Language · Computer Science 2021-04-16 Jonathan Herzig , Peter Shaw , Ming-Wei Chang , Kelvin Guu , Panupong Pasupat , Yuan Zhang

Compositional generalization is a basic and essential intellective capability of human beings, which allows us to recombine known parts readily. However, existing neural network based models have been proven to be extremely deficient in…

Artificial Intelligence · Computer Science 2020-10-27 Qian Liu , Shengnan An , Jian-Guang Lou , Bei Chen , Zeqi Lin , Yan Gao , Bin Zhou , Nanning Zheng , Dongmei Zhang

Having access to multi-modal cues (e.g. vision and audio) empowers some cognitive tasks to be done faster compared to learning from a single modality. In this work, we propose to transfer knowledge across heterogeneous modalities, even…

Computer Vision and Pattern Recognition · Computer Science 2021-04-23 Yanbei Chen , Yongqin Xian , A. Sophia Koepke , Ying Shan , Zeynep Akata

Compositional generalization, the ability of an agent to generalize to unseen combinations of latent factors, is easy for humans but hard for deep neural networks. A line of research in cognitive science has hypothesized a process,…

Machine Learning · Computer Science 2023-10-31 Yi Ren , Samuel Lavoie , Mikhail Galkin , Danica J. Sutherland , Aaron Courville

Multimodal deep learning systems which employ multiple modalities like text, image, audio, video, etc., are showing better performance in comparison with individual modalities (i.e., unimodal) systems. Multimodal machine learning involves…

Machine Learning · Computer Science 2022-01-19 Anil Rahate , Rahee Walambe , Sheela Ramanna , Ketan Kotecha

As deep neural networks become more adept at traditional tasks, many of the most exciting new challenges concern multimodality---observations that combine diverse types, such as image and text. In this paper, we introduce a family of…

Machine Learning · Computer Science 2019-12-12 Mike Wu , Noah Goodman

In the real world, where information is abundant and diverse across different modalities, understanding and utilizing various data types to improve retrieval systems is a key focus of research. Multimodal composite retrieval integrates…

Computer Vision and Pattern Recognition · Computer Science 2024-09-12 Suyan Li , Fuxiang Huang , Lei Zhang

Single-image 3D shape reconstruction is an important and long-standing problem in computer vision. A plethora of existing works is constantly pushing the state-of-the-art performance in the deep learning era. However, there remains a much…

Computer Vision and Pattern Recognition · Computer Science 2021-04-23 Songfang Han , Jiayuan Gu , Kaichun Mo , Li Yi , Siyu Hu , Xuejin Chen , Hao Su

Modern generative models exhibit unprecedented capabilities to generate extremely realistic data. However, given the inherent compositionality of the real world, reliable use of these models in practical applications requires that they…

Machine Learning · Computer Science 2025-07-29 Maya Okawa , Ekdeep Singh Lubana , Robert P. Dick , Hidenori Tanaka
‹ Prev 1 2 3 10 Next ›