Related papers: Self-Supervised Learning from Web Data for Multimo…

Learning to Learn from Web Data through Deep Semantic Embeddings

In this paper we propose to learn a multimodal image and text embedding from Web and Social Media data, aiming to leverage the semantic knowledge learnt in the text domain and transfer it to a visual model for semantic image retrieval. We…

Computer Vision and Pattern Recognition · Computer Science 2018-08-21 Raul Gomez , Lluis Gomez , Jaume Gibert , Dimosthenis Karatzas

Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval

Cross-modal retrieval between visual data and natural language description remains a long-standing challenge in multimedia. While recent image-text retrieval methods offer great promise by learning deep representations aligned across…

Multimedia · Computer Science 2018-08-24 Niluthpol Chowdhury Mithun , Rameswar Panda , Evangelos E. Papalexakis , Amit K. Roy-Chowdhury

Self-supervised learning of visual features through embedding images into text topic spaces

End-to-end training from scratch of current deep architectures for new computer vision problems would require Imagenet-scale datasets, and this is not always possible. In this paper we present a method that is able to take advantage of…

Computer Vision and Pattern Recognition · Computer Science 2017-05-25 Lluis Gomez , Yash Patel , Marçal Rusiñol , Dimosthenis Karatzas , C. V. Jawahar

Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos

Multimodal self-supervised learning is getting more and more attention as it allows not only to train large networks without human supervision but also to search and retrieve data across various modalities. In this context, this paper…

Computer Vision and Pattern Recognition · Computer Science 2021-10-18 Brian Chen , Andrew Rouditchenko , Kevin Duarte , Hilde Kuehne , Samuel Thomas , Angie Boggust , Rameswar Panda , Brian Kingsbury , Rogerio Feris , David Harwath , James Glass , Michael Picheny , Shih-Fu Chang

TextTopicNet - Self-Supervised Learning of Visual Features Through Embedding Images on Semantic Text Spaces

The immense success of deep learning based methods in computer vision heavily relies on large scale training datasets. These richly annotated datasets help the network learn discriminative visual features. Collecting and annotating such…

Computer Vision and Pattern Recognition · Computer Science 2018-07-09 Yash Patel , Lluis Gomez , Raul Gomez , Marçal Rusiñol , Dimosthenis Karatzas , C. V. Jawahar

Self-Supervised Image-to-Text and Text-to-Image Synthesis

A comprehensive understanding of vision and language and their interrelation are crucial to realize the underlying similarities and differences between these modalities and to learn more generalized, meaningful representations. In recent…

Computer Vision and Pattern Recognition · Computer Science 2021-12-10 Anindya Sundar Das , Sriparna Saha

Learning Multilingual Word Embeddings Using Image-Text Data

There has been significant interest recently in learning multilingual word embeddings -- in which semantically similar words across languages have similar embeddings. State-of-the-art approaches have relied on expensive labeled data, which…

Computation and Language · Computer Science 2020-07-02 Karan Singhal , Karthik Raman , Balder ten Cate

Deep Multimodal Image-Text Embeddings for Automatic Cross-Media Retrieval

This paper considers the task of matching images and sentences by learning a visual-textual embedding space for cross-modal retrieval. Finding such a space is a challenging task since the features and representations of text and image are…

Information Retrieval · Computer Science 2020-02-28 Hadi Abdi Khojasteh , Ebrahim Ansari , Parvin Razzaghi , Akbar Karimi

Learning Social Image Embedding with Deep Multimodal Attention Networks

Learning social media data embedding by deep models has attracted extensive research interest as well as boomed a lot of applications, such as link prediction, classification, and cross-modal search. However, for social images which contain…

Multimedia · Computer Science 2017-10-19 Feiran Huang , Xiaoming Zhang , Zhoujun Li , Tao Mei , Yueying He , Zhonghua Zhao

Learning semantic sentence representations from visually grounded language without lexical knowledge

Current approaches to learning semantic representations of sentences often use prior word-level knowledge. The current study aims to leverage visual information in order to capture sentence level semantics without the need for word…

Computation and Language · Computer Science 2019-09-25 Danny Merkx , Stefan Frank

Learning Robust Visual-Semantic Embeddings

Many of the existing methods for learning joint embedding of images and text use only supervised information from paired images and its textual attributes. Taking advantage of the recent success of unsupervised learning in deep neural…

Computer Vision and Pattern Recognition · Computer Science 2017-03-21 Yao-Hung Hubert Tsai , Liang-Kang Huang , Ruslan Salakhutdinov

Deep Unified Multimodal Embeddings for Understanding both Content and Users in Social Media Networks

There has been an explosion of multimodal content generated on social media networks in the last few years, which has necessitated a deeper understanding of social media content and user behavior. We present a novel content-independent…

Information Retrieval · Computer Science 2019-06-12 Karan Sikka , Lucas Van Bramer , Ajay Divakaran

Multi-Modality Deep Network for Extreme Learned Image Compression

Image-based single-modality compression learning approaches have demonstrated exceptionally powerful encoding and decoding capabilities in the past few years , but suffer from blur and severe semantics loss at extremely low bitrates. To…

Image and Video Processing · Electrical Eng. & Systems 2023-04-27 Xuhao Jiang , Weimin Tan , Tian Tan , Bo Yan , Liquan Shen

Understanding, Categorizing and Predicting Semantic Image-Text Relations

Two modalities are often used to convey information in a complementary and beneficial manner, e.g., in online news, videos, educational resources, or scientific publications. The automatic understanding of semantic correlations between text…

Multimedia · Computer Science 2019-06-21 Christian Otto , Matthias Springstein , Avishek Anand , Ralph Ewerth

Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech

In this paper, we explore the learning of neural network embeddings for natural images and speech waveforms describing the content of those images. These embeddings are learned directly from the waveforms without the use of linguistic…

Computation and Language · Computer Science 2018-04-10 David Harwath , Galen Chuang , James Glass

Self-Supervised Pre-training with Symmetric Superimposition Modeling for Scene Text Recognition

In text recognition, self-supervised pre-training emerges as a good solution to reduce dependence on expansive annotated real data. Previous studies primarily focus on local visual representation by leveraging mask image modeling or…

Computer Vision and Pattern Recognition · Computer Science 2024-05-14 Zuan Gao , Yuxin Wang , Yadong Qu , Boqiang Zhang , Zixiao Wang , Jianjun Xu , Hongtao Xie

Multimodal Semi-Supervised Learning for Text Recognition

Until recently, the number of public real-world text images was insufficient for training scene text recognizers. Therefore, most modern training methods rely on synthetic data and operate in a fully supervised manner. Nevertheless, the…

Computer Vision and Pattern Recognition · Computer Science 2022-05-10 Aviad Aberdam , Roy Ganz , Shai Mazor , Ron Litman

Learning the semantic structure of objects from Web supervision

While recent research in image understanding has often focused on recognizing more types of objects, understanding more about the objects is just as important. Recognizing object parts and attributes has been extensively studied before, yet…

Computer Vision and Pattern Recognition · Computer Science 2021-12-03 David Novotny , Diane Larlus , Andrea Vedaldi

Self-Supervised Visual Representations for Cross-Modal Retrieval

Cross-modal retrieval methods have been significantly improved in last years with the use of deep neural networks and large-scale annotated datasets such as ImageNet and Places. However, collecting and annotating such datasets requires a…

Computer Vision and Pattern Recognition · Computer Science 2019-02-04 Yash Patel , Lluis Gomez , Marçal Rusiñol , Dimosthenis Karatzas , C. V. Jawahar

A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings

Recent work has managed to learn cross-lingual word embeddings without parallel data by mapping monolingual embeddings to a shared space through adversarial training. However, their evaluation has focused on favorable conditions, using…

Computation and Language · Computer Science 2021-12-28 Mikel Artetxe , Gorka Labaka , Eneko Agirre