Related papers: Embodied Multimodal Multitask Learning

Multi-Head Attention with Diversity for Learning Grounded Multilingual Multimodal Representations

With the aim of promoting and understanding the multilingual version of image search, we leverage visual object detection and propose a model with diverse multi-head attention to learn grounded multilingual multimodal representations.…

Computation and Language · Computer Science 2019-10-02 Po-Yao Huang , Xiaojun Chang , Alexander Hauptmann

Instruction-Following Agents with Multimodal Transformer

Humans are excellent at understanding language and vision to accomplish a wide range of tasks. In contrast, creating general instruction-following embodied agents remains a difficult challenge. Prior work that uses pure language-only models…

Computer Vision and Pattern Recognition · Computer Science 2023-03-28 Hao Liu , Lisa Lee , Kimin Lee , Pieter Abbeel

Learning Zero-Shot Multifaceted Visually Grounded Word Embeddings via Multi-Task Training

Language grounding aims at linking the symbolic representation of language (e.g., words) into the rich perceptual knowledge of the outside world. The general approach is to embed both textual and visual information into a common space -the…

Computation and Language · Computer Science 2021-09-15 Hassan Shahmohammadi , Hendrik P. A. Lensch , R. Harald Baayen

A Visual Attention Grounding Neural Model for Multimodal Machine Translation

We introduce a novel multimodal machine translation model that utilizes parallel visual and textual information. Our model jointly optimizes the learning of a shared visual-language embedding and a translator. The model leverages a visual…

Computation and Language · Computer Science 2018-08-29 Mingyang Zhou , Runxiang Cheng , Yong Jae Lee , Zhou Yu

Imagination improves Multimodal Translation

We decompose multimodal translation into two sub-tasks: learning to translate and learning visually grounded representations. In a multitask learning framework, translations are learned in an attention-based encoder-decoder, and grounded…

Computation and Language · Computer Science 2017-07-10 Desmond Elliott , Ákos Kádár

Multitask Multimodal Prompted Training for Interactive Embodied Task Completion

Interactive and embodied tasks pose at least two fundamental challenges to existing Vision & Language (VL) models, including 1) grounding language in trajectories of actions and observations, and 2) referential disambiguation. To tackle…

Machine Learning · Computer Science 2023-11-08 Georgios Pantazopoulos , Malvina Nikandrou , Amit Parekh , Bhathiya Hemanthage , Arash Eshghi , Ioannis Konstas , Verena Rieser , Oliver Lemon , Alessandro Suglia

Disentanglement for audio-visual emotion recognition using multitask setup

Deep learning models trained on audio-visual data have been successfully used to achieve state-of-the-art performance for emotion recognition. In particular, models trained with multitask learning have shown additional performance…

Image and Video Processing · Electrical Eng. & Systems 2021-02-15 Raghuveer Peri , Srinivas Parthasarathy , Charles Bradshaw , Shiva Sundaram

DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention

Vision-and-language (V-L) tasks require the system to understand both vision content and natural language, thus learning fine-grained joint representations of vision and language (a.k.a. V-L representations) is of paramount importance.…

Computer Vision and Pattern Recognition · Computer Science 2022-11-01 Fenglin Liu , Xian Wu , Shen Ge , Xuancheng Ren , Wei Fan , Xu Sun , Yuexian Zou

Aligned Image-Word Representations Improve Inductive Transfer Across Vision-Language Tasks

An important goal of computer vision is to build systems that learn visual representations over time that can be applied to many tasks. In this paper, we investigate a vision-language embedding as a core representation and show that it…

Computer Vision and Pattern Recognition · Computer Science 2017-10-17 Tanmay Gupta , Kevin Shih , Saurabh Singh , Derek Hoiem

Learning to Represent Bilingual Dictionaries

Bilingual word embeddings have been widely used to capture the similarity of lexical semantics in different human languages. However, many applications, such as cross-lingual semantic search and question answering, can be largely benefited…

Computation and Language · Computer Science 2019-09-10 Muhao Chen , Yingtao Tian , Haochen Chen , Kai-Wei Chang , Steven Skiena , Carlo Zaniolo

Unsupervised Multimodal Language Representations using Convolutional Autoencoders

Multimodal Language Analysis is a demanding area of research, since it is associated with two requirements: combining different modalities and capturing temporal information. During the last years, several works have been proposed in the…

Computation and Language · Computer Science 2022-01-10 Panagiotis Koromilas , Theodoros Giannakopoulos

Multi-mapping Image-to-Image Translation via Learning Disentanglement

Recent advances of image-to-image translation focus on learning the one-to-many mapping from two aspects: multi-modal translation and multi-domain translation. However, the existing methods only consider one of the two perspectives, which…

Computer Vision and Pattern Recognition · Computer Science 2019-12-30 Xiaoming Yu , Yuanqi Chen , Thomas Li , Shan Liu , Ge Li

Learning deep representation of multityped objects and tasks

We introduce a deep multitask architecture to integrate multityped representations of multimodal objects. This multitype exposition is less abstract than the multimodal characterization, but more machine-friendly, and thus is more precise…

Machine Learning · Statistics 2016-03-07 Truyen Tran , Dinh Phung , Svetha Venkatesh

Probing Contextualized Sentence Representations with Visual Awareness

We present a universal framework to model contextualized sentence representations with visual awareness that is motivated to overcome the shortcomings of the multimodal parallel data with manual annotations. For each sentence, we first…

Computation and Language · Computer Science 2019-11-12 Zhuosheng Zhang , Rui Wang , Kehai Chen , Masao Utiyama , Eiichiro Sumita , Hai Zhao

Embodied Visual Active Learning for Semantic Segmentation

We study the task of embodied visual active learning, where an agent is set to explore a 3d environment with the goal to acquire visual scene understanding by actively selecting views for which to request annotation. While accurate on some…

Computer Vision and Pattern Recognition · Computer Science 2020-12-18 David Nilsson , Aleksis Pirinen , Erik Gärtner , Cristian Sminchisescu

Learning Multimodal Word Representation via Dynamic Fusion Methods

Multimodal models have been proven to outperform text-based models on learning semantic word representations. Almost all previous multimodal models typically treat the representations from different modalities equally. However, it is…

Computation and Language · Computer Science 2018-01-03 Shaonan Wang , Jiajun Zhang , Chengqing Zong

Multimodal Neurons in Pretrained Text-Only Transformers

Language models demonstrate remarkable capacity to generalize representations learned in one modality to downstream tasks in other modalities. Can we trace this ability to individual neurons? We study the case where a frozen text…

Computer Vision and Pattern Recognition · Computer Science 2023-10-03 Sarah Schwettmann , Neil Chowdhury , Samuel Klein , David Bau , Antonio Torralba

BiTAgent: A Task-Aware Modular Framework for Bidirectional Coupling between Multimodal Large Language Models and World Models

Building generalist embodied agents requires a unified system that can interpret multimodal goals, model environment dynamics, and execute reliable actions across diverse real-world tasks. Multimodal large language models (MLLMs) offer…

Artificial Intelligence · Computer Science 2025-12-05 Yu-Wei Zhan , Xin Wang , Pengzhe Mao , Tongtong Feng , Ren Wang , Wenwu Zhu

Show, Translate and Tell

Humans have an incredible ability to process and understand information from multiple sources such as images, video, text, and speech. Recent success of deep neural networks has enabled us to develop algorithms which give machines the…

Computer Vision and Pattern Recognition · Computer Science 2019-03-18 Dheeraj Peri , Shagan Sah , Raymond Ptucha

Tied Multitask Learning for Neural Speech Translation

We explore multitask models for neural translation of speech, augmenting them in order to reflect two intuitive notions. First, we introduce a model where the second task decoder receives information from the decoder of the first task,…

Computation and Language · Computer Science 2018-04-27 Antonios Anastasopoulos , David Chiang