Related papers: Learnable Pooling Methods for Video Classification

Hierarchical Deep Recurrent Architecture for Video Understanding

This paper introduces the system we developed for the Youtube-8M Video Understanding Challenge, in which a large-scale benchmark dataset was used for multi-label video classification. The proposed framework contains hierarchical deep…

Computer Vision and Pattern Recognition · Computer Science 2017-07-12 Luming Tang , Boyang Deng , Haiyu Zhao , Shuai Yi

Learnable pooling with Context Gating for video classification

Current methods for video analysis often extract frame-level features using pre-trained convolutional neural networks (CNNs). Such features are then aggregated over time e.g., by simple temporal averaging or more sophisticated recurrent…

Computer Vision and Pattern Recognition · Computer Science 2018-03-06 Antoine Miech , Ivan Laptev , Josef Sivic

Aggregating Frame-level Features for Large-Scale Video Classification

This paper introduces the system we developed for the Google Cloud & YouTube-8M Video Understanding Challenge, which can be considered as a multi-label classification problem defined on top of the large scale YouTube-8M Dataset. We employ a…

Computer Vision and Pattern Recognition · Computer Science 2017-07-05 Shaoxiang Chen , Xi Wang , Yongyi Tang , Xinpeng Chen , Zuxuan Wu , Yu-Gang Jiang

Large-Scale YouTube-8M Video Understanding with Deep Neural Networks

Video classification problem has been studied many years. The success of Convolutional Neural Networks (CNN) in image recognition tasks gives a powerful incentive for researchers to create more advanced video classification approaches. As…

Computer Vision and Pattern Recognition · Computer Science 2017-06-15 Manuk Akopyan , Eshsou Khashba

Cross-modal Embeddings for Video and Audio Retrieval

The increasing amount of online videos brings several opportunities for training self-supervised neural networks. The creation of large scale datasets of videos such as the YouTube-8M allows us to deal with this large amount of data in…

Information Retrieval · Computer Science 2018-01-09 Didac Surís , Amanda Duarte , Amaia Salvador , Jordi Torres , Xavier Giró-i-Nieto

Rank Pooling for Action Recognition

We propose a function-based temporal pooling method that captures the latent structure of the video sequence data - e.g. how frame-level features evolve over time in a video. We show how the parameters of a function that has been fit to the…

Computer Vision and Pattern Recognition · Computer Science 2016-05-17 Basura Fernando , Efstratios Gavves , Jose Oramas , Amir Ghodrati , Tinne Tuytelaars

A novel learning-based frame pooling method for Event Detection

Detecting complex events in a large video collection crawled from video websites is a challenging task. When applying directly good image-based feature representation, e.g., HOG, SIFT, to videos, we have to face the problem of how to pool…

Computer Vision and Pattern Recognition · Computer Science 2016-08-22 Lan Wang , Chenqiang Gao , Jiang Liu , Deyu Meng

Multi-attention Networks for Temporal Localization of Video-level Labels

Temporal localization remains an important challenge in video understanding. In this work, we present our solution to the 3rd YouTube-8M Video Understanding Challenge organized by Google Research. Participants were required to build a…

Computer Vision and Pattern Recognition · Computer Science 2019-11-19 Lijun Zhang , Srinath Nizampatnam , Ahana Gangopadhyay , Marcos V. Conde

AdaScan: Adaptive Scan Pooling in Deep Convolutional Neural Networks for Human Action Recognition in Videos

We propose a novel method for temporally pooling frames in a video for the task of human action recognition. The method is motivated by the observation that there are only a small number of frames which, together, contain sufficient…

Computer Vision and Pattern Recognition · Computer Science 2017-06-27 Amlan Kar , Nishant Rai , Karan Sikka , Gaurav Sharma

Deep Learning Methods for Efficient Large Scale Video Labeling

We present a solution to "Google Cloud and YouTube-8M Video Understanding Challenge" that ranked 5th place. The proposed model is an ensemble of three model families, two frame level and one video level. The training was performed on…

Machine Learning · Statistics 2017-06-15 Miha Skalic , Marcin Pekalski , Xingguo E. Pan

Temporal Modeling Approaches for Large-scale Youtube-8M Video Understanding

This paper describes our solution for the video recognition task of the Google Cloud and YouTube-8M Video Understanding Challenge that ranked the 3rd place. Because the challenge provides pre-extracted visual and audio features instead of…

Computer Vision and Pattern Recognition · Computer Science 2017-07-17 Fu Li , Chuang Gan , Xiao Liu , Yunlong Bian , Xiang Long , Yandong Li , Zhichao Li , Jie Zhou , Shilei Wen

Multimodal Lengthy Videos Retrieval Framework and Evaluation Metric

Precise video retrieval requires multi-modal correlations to handle unseen vocabulary and scenes, becoming more complex for lengthy videos where models must perform effectively without prior training on a specific dataset. We introduce a…

Computer Vision and Pattern Recognition · Computer Science 2025-04-08 Mohamed Eltahir , Osamah Sarraj , Mohammed Bremoo , Mohammed Khurd , Abdulrahman Alfrihidi , Taha Alshatiri , Mohammad Almatrafi , Tanveer Hussain

Video Representation Learning Using Discriminative Pooling

Popular deep models for action recognition in videos generate independent predictions for short clips, which are then pooled heuristically to assign an action label to the full video segment. As not all frames may characterize the…

Computer Vision and Pattern Recognition · Computer Science 2018-04-02 Jue Wang , Anoop Cherian , Fatih Porikli , Stephen Gould

Deep Architectures and Ensembles for Semantic Video Classification

This work addresses the problem of accurate semantic labelling of short videos. To this end, a multitude of different deep nets, ranging from traditional recurrent neural networks (LSTM, GRU), temporal agnostic networks (FV,VLAD,BoW), fully…

Computer Vision and Pattern Recognition · Computer Science 2018-10-09 Eng-Jon Ong , Sameed Husain , Mikel Bober-Irizar , Miroslaw Bober

Video Panels for Long Video Understanding

Recent Video-Language Models (VLMs) achieve promising results on long-video understanding, but their performance still lags behind that achieved on tasks involving images or short videos. This has led to great interest in improving the long…

Computer Vision and Pattern Recognition · Computer Science 2026-04-21 Lars Doorenbos , Federico Spurio , Juergen Gall

Attentional Pooling for Action Recognition

We introduce a simple yet surprisingly powerful model to incorporate attention in action recognition and human object interaction tasks. Our proposed attention module can be trained with or without extra supervision, and gives a sizable…

Computer Vision and Pattern Recognition · Computer Science 2018-01-03 Rohit Girdhar , Deva Ramanan

Learnable Pooling Regions for Image Classification

Biologically inspired, from the early HMAX model to Spatial Pyramid Matching, pooling has played an important role in visual recognition pipelines. Spatial pooling, by grouping of local codes, equips these methods with a certain degree of…

Computer Vision and Pattern Recognition · Computer Science 2015-05-06 Mateusz Malinowski , Mario Fritz

The YouTube-8M Kaggle Competition: Challenges and Methods

We took part in the YouTube-8M Video Understanding Challenge hosted on Kaggle, and achieved the 10th place within less than one month's time. In this paper, we present an extensive analysis and solution to the underlying machine-learning…

Computer Vision and Pattern Recognition · Computer Science 2017-07-14 Haosheng Zou , Kun Xu , Jialian Li , Jun Zhu

YouTube-8M Video Understanding Challenge Approach and Applications

This paper introduces the YouTube-8M Video Understanding Challenge hosted as a Kaggle competition and also describes my approach to experimenting with various models. For each of my experiments, I provide the score result as well as…

Machine Learning · Statistics 2017-06-27 Edward Chen

Knowledge Distillation for Efficient Audio-Visual Video Captioning

Automatically describing audio-visual content with texts, namely video captioning, has received significant attention due to its potential applications across diverse fields. Deep neural networks are the dominant methods, offering…

Audio and Speech Processing · Electrical Eng. & Systems 2023-06-19 Özkan Çaylı , Xubo Liu , Volkan Kılıç , Wenwu Wang