Related papers: Learning Visual Actions Using Multiple Verb-Only L…

Towards an Unequivocal Representation of Actions

This work introduces verb-only representations for actions and interactions; the problem of describing similar motions (e.g. 'open door', 'open cupboard'), and distinguish differing ones (e.g. 'open door' vs 'open bottle') using verb-only…

Computer Vision and Pattern Recognition · Computer Science 2018-05-11 Michael Wray , Davide Moltisanti , Dima Damen

Unsupervised Learning of View-invariant Action Representations

The recent success in human action recognition with deep learning methods mostly adopt the supervised learning paradigm, which requires significant amount of manually labeled data to achieve good performance. However, label collection is an…

Computer Vision and Pattern Recognition · Computer Science 2018-09-07 Junnan Li , Yongkang Wong , Qi Zhao , Mohan S. Kankanhalli

An Action Is Worth Multiple Words: Handling Ambiguity in Action Recognition

Precisely naming the action depicted in a video can be a challenging and oftentimes ambiguous task. In contrast to object instances represented as nouns (e.g. dog, cat, chair, etc.), in the case of actions, human annotators typically lack a…

Computer Vision and Pattern Recognition · Computer Science 2022-10-12 Kiyoon Kim , Davide Moltisanti , Oisin Mac Aodha , Laura Sevilla-Lara

Open Vocabulary Multi-Label Video Classification

Pre-trained vision-language models (VLMs) have enabled significant progress in open vocabulary computer vision tasks such as image classification, object detection and image segmentation. Some recent works have focused on extending VLMs to…

Computer Vision and Pattern Recognition · Computer Science 2025-10-14 Rohit Gupta , Mamshad Nayeem Rizve , Jayakrishnan Unnikrishnan , Ashish Tawari , Son Tran , Mubarak Shah , Benjamin Yao , Trishul Chilimbi

Improving Classification by Improving Labelling: Introducing Probabilistic Multi-Label Object Interaction Recognition

This work deviates from easy-to-define class boundaries for object interactions. For the task of object interaction recognition, often captured using an egocentric view, we show that semantic ambiguities in verbs and recognising…

Computer Vision and Pattern Recognition · Computer Science 2017-04-24 Michael Wray , Davide Moltisanti , Walterio Mayol-Cuevas , Dima Damen

Multi-View Video-Based Learning: Leveraging Weak Labels for Frame-Level Perception

For training a video-based action recognition model that accepts multi-view video, annotating frame-level labels is tedious and difficult. However, it is relatively easy to annotate sequence-level labels. This kind of coarse annotations are…

Computer Vision and Pattern Recognition · Computer Science 2024-03-20 Vijay John , Yasutomo Kawanishi

Action Selection Learning for Multi-label Multi-view Action Recognition

Multi-label multi-view action recognition aims to recognize multiple concurrent or sequential actions from untrimmed videos captured by multiple cameras. Existing work has focused on multi-view action recognition in a narrow area with…

Computer Vision and Pattern Recognition · Computer Science 2024-10-21 Trung Thanh Nguyen , Yasutomo Kawanishi , Takahiro Komamizu , Ichiro Ide

Discovering Multi-Label Actor-Action Association in a Weakly Supervised Setting

Since collecting and annotating data for spatio-temporal action detection is very expensive, there is a need to learn approaches with less supervision. Weakly supervised approaches do not require any bounding box annotations and can be…

Computer Vision and Pattern Recognition · Computer Science 2021-01-22 Sovan Biswas , Juergen Gall

Multiview Pseudo-Labeling for Semi-supervised Learning from Video

We present a multiview pseudo-labeling approach to video learning, a novel framework that uses complementary views in the form of appearance and motion information for semi-supervised learning in video. The complementary views help obtain…

Computer Vision and Pattern Recognition · Computer Science 2021-04-02 Bo Xiong , Haoqi Fan , Kristen Grauman , Christoph Feichtenhofer

Large-Scale Automatic Labeling of Video Events with Verbs Based on Event-Participant Interaction

We present an approach to labeling short video clips with English verbs as event descriptions. A key distinguishing aspect of this work is that it labels videos with verbs that describe the spatiotemporal interaction between event…

Computer Vision and Pattern Recognition · Computer Science 2012-04-17 Andrei Barbu , Alexander Bridge , Dan Coroian , Sven Dickinson , Sam Mussman , Siddharth Narayanaswamy , Dhaval Salvi , Lara Schmidt , Jiangnan Shangguan , Jeffrey Mark Siskind , Jarrell Waggoner , Song Wang , Jinlian Wei , Yifan Yin , Zhiqi Zhang

Localization vs. Semantics: Visual Representations in Unimodal and Multimodal Models

Despite the impressive advancements achieved through vision-and-language pretraining, it remains unclear whether this joint learning paradigm can help understand each individual modality. In this work, we conduct a comparative analysis of…

Computer Vision and Pattern Recognition · Computer Science 2024-01-31 Zhuowan Li , Cihang Xie , Benjamin Van Durme , Alan Yuille

Vision-Language Pseudo-Labels for Single-Positive Multi-Label Learning

This paper presents a novel approach to Single-Positive Multi-label Learning. In general multi-label learning, a model learns to predict multiple labels or categories for a single input image. This is in contrast with standard multi-class…

Computer Vision and Pattern Recognition · Computer Science 2023-10-25 Xin Xing , Zhexiao Xiong , Abby Stylianou , Srikumar Sastry , Liyu Gong , Nathan Jacobs

Learning from Multiview Correlations in Open-Domain Videos

An increasing number of datasets contain multiple views, such as video, sound and automatic captions. A basic challenge in representation learning is how to leverage multiple views to learn better representations. This is further…

Machine Learning · Computer Science 2019-03-04 Nils Holzenberger , Shruti Palaskar , Pranava Madhyastha , Florian Metze , Raman Arora

Action Modifiers: Learning from Adverbs in Instructional Videos

We present a method to learn a representation for adverbs from instructional videos using weak supervision from the accompanying narrations. Key to our method is the fact that the visual representation of the adverb is highly dependant on…

Computer Vision and Pattern Recognition · Computer Science 2020-03-25 Hazel Doughty , Ivan Laptev , Walterio Mayol-Cuevas , Dima Damen

Visual Semantic Role Labeling

In this paper we introduce the problem of Visual Semantic Role Labeling: given an image we want to detect people doing actions and localize the objects of interaction. Classical approaches to action recognition either study the task of…

Computer Vision and Pattern Recognition · Computer Science 2015-05-19 Saurabh Gupta , Jitendra Malik

Weakly Supervised Representation Learning for Unsynchronized Audio-Visual Events

Audio-visual representation learning is an important task from the perspective of designing machines with the ability to understand complex events. To this end, we propose a novel multimodal framework that instantiates multiple instance…

Computer Vision and Pattern Recognition · Computer Science 2018-07-10 Sanjeel Parekh , Slim Essid , Alexey Ozerov , Ngoc Q. K. Duong , Patrick Pérez , Gaël Richard

Probing Image-Language Transformers for Verb Understanding

Multimodal image-language transformers have achieved impressive results on a variety of tasks that rely on fine-tuning (e.g., visual question answering and image retrieval). We are interested in shedding light on the quality of their…

Computation and Language · Computer Science 2021-06-18 Lisa Anne Hendricks , Aida Nematzadeh

Learning Visual Representations via Language-Guided Sampling

Although an object may appear in numerous contexts, we often describe it in a limited number of ways. Language allows us to abstract away visual variation to represent and communicate concepts. Building on this intuition, we propose an…

Computer Vision and Pattern Recognition · Computer Science 2023-03-30 Mohamed El Banani , Karan Desai , Justin Johnson

Learning Multi-Modal Word Representation Grounded in Visual Context

Representing the semantics of words is a long-standing problem for the natural language processing community. Most methods compute word semantics given their textual context in large corpora. More recently, researchers attempted to…

Computation and Language · Computer Science 2017-11-10 Éloi Zablocki , Benjamin Piwowarski , Laure Soulier , Patrick Gallinari

Learning Cross-lingual Visual Speech Representations

Cross-lingual self-supervised learning has been a growing research topic in the last few years. However, current works only explored the use of audio signals to create representations. In this work, we study cross-lingual self-supervised…

Computation and Language · Computer Science 2023-03-17 Andreas Zinonos , Alexandros Haliassos , Pingchuan Ma , Stavros Petridis , Maja Pantic