Related papers: Learning Visual N-Grams from Web Data

Vision-Language Models for Vision Tasks: A Survey

Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition…

Computer Vision and Pattern Recognition · Computer Science 2024-02-19 Jingyi Zhang , Jiaxing Huang , Sheng Jin , Shijian Lu

Learning Transferable Visual Models From Natural Language Supervision

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any…

Computer Vision and Pattern Recognition · Computer Science 2021-03-02 Alec Radford , Jong Wook Kim , Chris Hallacy , Aditya Ramesh , Gabriel Goh , Sandhini Agarwal , Girish Sastry , Amanda Askell , Pamela Mishkin , Jack Clark , Gretchen Krueger , Ilya Sutskever

Open-World Visual Recognition Using Knowledge Graphs

In a real-world setting, visual recognition systems can be brought to make predictions for images belonging to previously unknown class labels. In order to make semantically meaningful predictions for such inputs, we propose a two-step…

Machine Learning · Computer Science 2017-08-29 Vincent P. A. Lonij , Ambrish Rawat , Maria-Irina Nicolae

Visually grounded few-shot word acquisition with fewer shots

We propose a visually grounded speech model that acquires new words and their visual depictions from just a few word-image example pairs. Given a set of test images and a spoken query, we ask the model which image depicts the query word.…

Computation and Language · Computer Science 2023-05-31 Leanne Nortje , Benjamin van Niekerk , Herman Kamper

Learning Visual Features from Large Weakly Supervised Data

Convolutional networks trained on large supervised dataset produce visual features which form the basis for the state-of-the-art in many computer-vision problems. Further improvements of these visual features will likely require even larger…

Computer Vision and Pattern Recognition · Computer Science 2015-11-10 Armand Joulin , Laurens van der Maaten , Allan Jabri , Nicolas Vasilache

Combining Language and Vision with a Multimodal Skip-gram Model

We extend the SKIP-GRAM model of Mikolov et al. (2013a) by taking visual information into account. Like SKIP-GRAM, our multimodal models (MMSKIP-GRAM) build vector-based word representations by learning to predict linguistic contexts in…

Computation and Language · Computer Science 2015-03-13 Angeliki Lazaridou , Nghia The Pham , Marco Baroni

Learning Deep Representations of Fine-grained Visual Descriptions

State-of-the-art methods for zero-shot visual recognition formulate learning as a joint embedding problem of images and side information. In these formulations the current best complement to visual features are attributes: manually encoded…

Computer Vision and Pattern Recognition · Computer Science 2016-05-19 Scott Reed , Zeynep Akata , Bernt Schiele , Honglak Lee

Learning to Name Classes for Vision and Language Models

Large scale vision and language models can achieve impressive zero-shot recognition performance by mapping class specific text queries to image content. Two distinct challenges that remain however, are high sensitivity to the choice of…

Computer Vision and Pattern Recognition · Computer Science 2023-04-05 Sarah Parisot , Yongxin Yang , Steven McDonagh

Perceptual Grouping in Contrastive Vision-Language Models

Recent advances in zero-shot image recognition suggest that vision-language models learn generic visual representations with a high degree of semantic information that may be arbitrarily probed with natural language phrases. Understanding…

Computer Vision and Pattern Recognition · Computer Science 2023-08-23 Kanchana Ranasinghe , Brandon McKinzie , Sachin Ravi , Yinfei Yang , Alexander Toshev , Jonathon Shlens

Learning like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images

In this paper, we address the task of learning novel visual concepts, and their interactions with other concepts, from a few images with sentence descriptions. Using linguistic context and visual features, our method is able to efficiently…

Computer Vision and Pattern Recognition · Computer Science 2015-10-05 Junhua Mao , Wei Xu , Yi Yang , Jiang Wang , Zhiheng Huang , Alan Yuille

Using n-grams models for visual semantic place recognition

The aim of this paper is to present a new method for visual place recognition. Our system combines global image characterization and visual words, which allows to use efficient Bayesian filtering methods to integrate several images. More…

Machine Learning · Statistics 2014-03-24 Mathieu Dubois , Frenoux Emmanuelle , Philippe Tarroux

Towards Learning a Vocabulary of Visual Concepts and Operators using Deep Neural Networks

Deep neural networks have become the default choice for many applications like image and video recognition, segmentation and other image and video related tasks.However, a critical challenge with these models is the lack of…

Computer Vision and Pattern Recognition · Computer Science 2021-09-02 Sunil Kumar Vengalil , Neelam Sinha

Extracting Visual Knowledge from the Internet: Making Sense of Image Data

Recent successes in visual recognition can be primarily attributed to feature representation, learning algorithms, and the ever-increasing size of labeled training data. Extensive research has been devoted to the first two, but much less…

Computer Vision and Pattern Recognition · Computer Science 2019-06-10 Yazhou Yao , Jian Zhang , Xiansheng Hua , Fumin Shen , Zhenmin Tang

From Captions to Visual Concepts and Back

This paper presents a novel approach for automatically generating image descriptions: visual detectors, language models, and multimodal similarity models learnt directly from a dataset of image captions. We use multiple instance learning to…

Computer Vision and Pattern Recognition · Computer Science 2016-02-22 Hao Fang , Saurabh Gupta , Forrest Iandola , Rupesh Srivastava , Li Deng , Piotr Dollár , Jianfeng Gao , Xiaodong He , Margaret Mitchell , John C. Platt , C. Lawrence Zitnick , Geoffrey Zweig

Learning to Represent Image and Text with Denotation Graph

Learning to fuse vision and language information and representing them is an important research problem with many applications. Recent progresses have leveraged the ideas of pre-training (from language modeling) and attention layers in…

Computer Vision and Pattern Recognition · Computer Science 2020-10-08 Bowen Zhang , Hexiang Hu , Vihan Jain , Eugene Ie , Fei Sha

A Vision Check-up for Language Models

What does learning to model relationships between strings teach large language models (LLMs) about the visual world? We systematically evaluate LLMs' abilities to generate and recognize an assortment of visual concepts of increasing…

Computer Vision and Pattern Recognition · Computer Science 2024-01-04 Pratyusha Sharma , Tamar Rott Shaham , Manel Baradad , Stephanie Fu , Adrian Rodriguez-Munoz , Shivam Duggal , Phillip Isola , Antonio Torralba

Visual Referring Expression Recognition: What Do Systems Actually Learn?

We present an empirical analysis of the state-of-the-art systems for referring expression recognition -- the task of identifying the object in an image referred to by a natural language expression -- with the goal of gaining insight into…

Computation and Language · Computer Science 2018-05-31 Volkan Cirik , Louis-Philippe Morency , Taylor Berg-Kirkpatrick

Visually grounded learning of keyword prediction from untranscribed speech

During language acquisition, infants have the benefit of visual cues to ground spoken language. Robots similarly have access to audio and visual sensors. Recent work has shown that images and spoken captions can be mapped into a meaningful…

Computation and Language · Computer Science 2017-05-29 Herman Kamper , Shane Settle , Gregory Shakhnarovich , Karen Livescu

Language learning using Speech to Image retrieval

Humans learn language by interaction with their environment and listening to other humans. It should also be possible for computational models to learn language directly from speech but so far most approaches require text. We improve on…

Computation and Language · Computer Science 2019-09-25 Danny Merkx , Stefan L. Frank , Mirjam Ernestus

Predicting Deep Zero-Shot Convolutional Neural Networks using Textual Descriptions

One of the main challenges in Zero-Shot Learning of visual categories is gathering semantic attributes to accompany images. Recent work has shown that learning from textual descriptions, such as Wikipedia articles, avoids the problem of…

Machine Learning · Computer Science 2015-09-28 Jimmy Ba , Kevin Swersky , Sanja Fidler , Ruslan Salakhutdinov