Related papers: Context-Oriented Web Video Tag Recommendation
Video ads segmentation and tagging is a challenging task due to two main reasons: (1) the video scene structure is complex and (2) it includes multiple modalities (e.g., visual, audio, text.). While previous work focuses mostly on activity…
Nowadays, online video platforms mostly recommend related videos by analyzing user-driven data such as viewing patterns, rather than the content of the videos. However, content is more important than any other element when videos aim to…
Contextual Suggestion deals with search techniques for complex information needs that are highly focused on context and user needs. In this paper, we propose \emph{R-Rec}, a novel rule-based technique to identify and recommend appropriate…
Text-to-video retrieval (TVR) aims to find the most relevant video in a large video gallery given a query text. The intricate and abundant context of the video challenges the performance and efficiency of TVR. To handle the serialized video…
Music listening preferences at a given time depend on a wide range of contextual factors, such as user emotional state, location and activity at listening time, the day of the week, the time of the day, etc. It is therefore of great…
Multi-modal retrieval is an important problem for many applications, such as recommendation and search. Current benchmarks and even datasets are often manually constructed and consist of mostly clean samples where all modalities are…
With the increasing importance of video data in real-world applications, there is a rising need for efficient object detection methods that utilize temporal information. While existing video object detection (VOD) techniques employ various…
The last decade has witnessed the proliferation of micro-videos on various user-generated content platforms. According to our statistics, around 85.7\% of micro-videos lack annotation. In this paper, we focus on annotating micro-videos with…
Tagging of visual content is becoming more and more widespread as web-based services and social networks have popularized tagging functionalities among their users. These user-generated tags are used to ease browsing and exploration of…
Creating a vivid video from the event or scenario in our imagination is a truly fascinating experience. Recent advancements in text-to-video synthesis have unveiled the potential to achieve this with prompts only. While text is convenient…
The large number of user-generated videos uploaded on to the Internet everyday has led to many commercial video search engines, which mainly rely on text metadata for search. However, metadata is often lacking for user-generated videos,…
Traditionally, recommender systems for the Web deal with applications that have two dimensions, users and items. Based on access logs that relate these dimensions, a recommendation model can be built and used to identify a set of N items…
Video Instance Segmentation is a fundamental computer vision task that deals with segmenting and tracking object instances across a video sequence. Most existing methods typically accomplish this task by employing a multi-stage top-down…
Videos on the Internet are paired with pieces of text, such as titles and descriptions. This text typically describes the most important content in the video, such as the objects in the scene and the actions being performed. Based on this…
Video retrieval requires aligning visual content with corresponding natural language descriptions. In this paper, we introduce Modality Auxiliary Concepts for Video Retrieval (MAC-VR), a novel approach that leverages modality-specific tags…
Video text spotting is still an important research topic due to its various real-applications. Previous approaches usually fall into the four-staged pipeline: text detection in individual images, framewisely recognizing localized text…
The amount of digital video data is increasing over the world. It highlights the need for efficient algorithms that can index, retrieve and browse this data by content. This can be achieved by identifying semantic description captured…
The extraction of text information in videos serves as a critical step towards semantic understanding of videos. It usually involved in two steps: (1) text recognition and (2) text classification. To localize texts in videos, we can resort…
Human perception and experience of music is highly context-dependent. Contextual variability contributes to differences in how we interpret and interact with music, challenging the design of robust models for information retrieval.…
This paper tackles a recently proposed Video Corpus Moment Retrieval task. This task is essential because advanced video retrieval applications should enable users to retrieve a precise moment from a large video corpus. We propose a novel…