Related papers: Deep Multimodal Feature Encoding for Video Orderin…

Multi-modal Transformer for Video Retrieval

The task of retrieving video content relevant to natural language queries plays a critical role in effectively handling internet-scale datasets. Most of the existing methods for this caption-to-video retrieval problem do not fully exploit…

Computer Vision and Pattern Recognition · Computer Science 2020-07-22 Valentin Gabeur , Chen Sun , Karteek Alahari , Cordelia Schmid

Multi-modal Representation Learning for Video Advertisement Content Structuring

Video advertisement content structuring aims to segment a given video advertisement and label each segment on various dimensions, such as presentation form, scene, and style. Different from real-life videos, video advertisements contain…

Computer Vision and Pattern Recognition · Computer Science 2021-09-15 Daya Guo , Zhaoyang Zeng

Learning Procedure-aware Video Representation from Instructional Videos and Their Narrations

The abundance of instructional videos and their narrations over the Internet offers an exciting avenue for understanding procedural activities. In this work, we propose to learn video representation that encodes both action steps and their…

Computer Vision and Pattern Recognition · Computer Science 2023-04-03 Yiwu Zhong , Licheng Yu , Yang Bai , Shangwen Li , Xueting Yan , Yin Li

MTLE: A Multitask Learning Encoder of Visual Feature Representations for Video and Movie Description

Learning visual feature representations for video analysis is a daunting task that requires a large amount of training samples and a proper generalization framework. Many of the current state of the art methods for video captioning and…

Machine Learning · Computer Science 2018-09-20 Oliver Nina , Washington Garcia , Scott Clouse , Alper Yilmaz

A Challenging Multimodal Video Summary: Simultaneously Extracting and Generating Keyframe-Caption Pairs from Video

This paper proposes a practical multimodal video summarization task setting and a dataset to train and evaluate the task. The target task involves summarizing a given video into a predefined number of keyframe-caption pairs and displaying…

Computation and Language · Computer Science 2023-12-05 Keito Kudo , Haruki Nagasawa , Jun Suzuki , Nobuyuki Shimizu

Everything is a Video: Unifying Modalities through Next-Frame Prediction

Multimodal learning, which involves integrating information from various modalities such as text, images, audio, and video, is pivotal for numerous complex tasks like visual question answering, cross-modal retrieval, and caption generation.…

Computer Vision and Pattern Recognition · Computer Science 2025-07-29 G. Thomas Hudson , Dean Slack , Thomas Winterbottom , Jamie Sterling , Chenghao Xiao , Junjie Shentu , Noura Al Moubayed

Learning Temporal Embeddings for Complex Video Analysis

In this paper, we propose to learn temporal embeddings of video frames for complex video analysis. Large quantities of unlabeled video data can be easily obtained from the Internet. These videos possess the implicit weak label that they are…

Computer Vision and Pattern Recognition · Computer Science 2015-05-05 Vignesh Ramanathan , Kevin Tang , Greg Mori , Li Fei-Fei

Masked Motion Encoding for Self-Supervised Video Representation Learning

How to learn discriminative video representation from unlabeled videos is challenging but crucial for video analysis. The latest attempts seek to learn a representation model by predicting the appearance contents in the masked regions.…

Computer Vision and Pattern Recognition · Computer Science 2023-03-24 Xinyu Sun , Peihao Chen , Liangwei Chen , Changhao Li , Thomas H. Li , Mingkui Tan , Chuang Gan

Deep video representation learning: a survey

This paper provides a review on representation learning for videos. We classify recent spatiotemporal feature learning methods for sequential visual data and compare their pros and cons for general video analysis. Building effective…

Computer Vision and Pattern Recognition · Computer Science 2024-05-13 Elham Ravanbakhsh , Yongqing Liang , J. Ramanujam , Xin Li

Classification of Important Segments in Educational Videos using Multimodal Features

Videos are a commonly-used type of content in learning during Web search. Many e-learning platforms provide quality content, but sometimes educational videos are long and cover many topics. Humans are good in extracting important sections…

Computer Vision and Pattern Recognition · Computer Science 2020-10-27 Junaid Ahmed Ghauri , Sherzod Hakimov , Ralph Ewerth

A Multimodal Framework for Video Ads Understanding

There is a growing trend in placing video advertisements on social platforms for online marketing, which demands automatic approaches to understand the contents of advertisements effectively. Taking the 2021 TAAC competition as an…

Computer Vision and Pattern Recognition · Computer Science 2021-08-31 Zejia Weng , Lingchen Meng , Rui Wang , Zuxuan Wu , Yu-Gang Jiang

A Comprehensive Study of Deep Video Action Recognition

Video action recognition is one of the representative tasks for video understanding. Over the last decade, we have witnessed great advancements in video action recognition thanks to the emergence of deep learning. But we also encountered…

Computer Vision and Pattern Recognition · Computer Science 2020-12-14 Yi Zhu , Xinyu Li , Chunhui Liu , Mohammadreza Zolfaghari , Yuanjun Xiong , Chongruo Wu , Zhi Zhang , Joseph Tighe , R. Manmatha , Mu Li

Overview of Tencent Multi-modal Ads Video Understanding Challenge

Multi-modal Ads Video Understanding Challenge is the first grand challenge aiming to comprehensively understand ads videos. Our challenge includes two tasks: video structuring in the temporal dimension and multi-modal video classification.…

Computer Vision and Pattern Recognition · Computer Science 2021-09-17 Zhenzhi Wang , Liyu Wu , Zhimin Li , Jiangfeng Xiong , Qinglin Lu

Multimodal Long Video Modeling Based on Temporal Dynamic Context

Recent advances in Large Language Models (LLMs) have led to significant breakthroughs in video understanding. However, existing models still struggle with long video processing due to the context length constraint of LLMs and the vast…

Computer Vision and Pattern Recognition · Computer Science 2025-04-15 Haoran Hao , Jiaming Han , Yiyuan Zhang , Xiangyu Yue

Progressive Video Summarization via Multimodal Self-supervised Learning

Modern video summarization methods are based on deep neural networks that require a large amount of annotated data for training. However, existing datasets for video summarization are small-scale, easily leading to over-fitting of the deep…

Computer Vision and Pattern Recognition · Computer Science 2022-10-20 Li Haopeng , Ke Qiuhong , Gong Mingming , Tom Drummond

Multi-modal Dense Video Captioning

Dense video captioning is a task of localizing interesting events from an untrimmed video and producing textual description (captions) for each localized event. Most of the previous works in dense video captioning are solely based on visual…

Computer Vision and Pattern Recognition · Computer Science 2020-05-07 Vladimir Iashin , Esa Rahtu

Video Time: Properties, Encoders and Evaluation

Time-aware encoding of frame sequences in a video is a fundamental problem in video understanding. While many attempted to model time in videos, an explicit study on quantifying video time is missing. To fill this lacuna, we aim to evaluate…

Computer Vision and Pattern Recognition · Computer Science 2018-07-19 Amir Ghodrati , Efstratios Gavves , Cees G. M. Snoek

End-to-end Multi-modal Video Temporal Grounding

We address the problem of text-guided video temporal grounding, which aims to identify the time interval of a certain event based on a natural language description. Different from most existing methods that only consider RGB images as…

Computer Vision and Pattern Recognition · Computer Science 2021-11-01 Yi-Wen Chen , Yi-Hsuan Tsai , Ming-Hsuan Yang

Knowledge-enhanced Multi-perspective Video Representation Learning for Scene Recognition

With the explosive growth of video data in real-world applications, a comprehensive representation of videos becomes increasingly important. In this paper, we address the problem of video scene recognition, whose goal is to learn a…

Computer Vision and Pattern Recognition · Computer Science 2025-05-20 Xuzheng Yu , Chen Jiang , Wei Zhang , Tian Gan , Linlin Chao , Jianan Zhao , Yuan Cheng , Qingpei Guo , Wei Chu

Exploiting Temporal Coherence for Multi-modal Video Categorization

Multimodal ML models can process data in multiple modalities (e.g., video, images, audio, text) and are useful for video content analysis in a variety of problems (e.g., object detection, scene understanding). In this paper, we focus on the…

Computer Vision and Pattern Recognition · Computer Science 2020-06-09 Palash Goyal , Saurabh Sahu , Shalini Ghosh , Chul Lee