Related papers: Weakly-Supervised Temporal Article Grounding

Empower Words: DualGround for Structured Phrase and Sentence-Level Temporal Grounding

Video Temporal Grounding (VTG) aims to localize temporal segments in long, untrimmed videos that align with a given natural language query. This task typically comprises two subtasks: Moment Retrieval (MR) and Highlight Detection (HD).…

Computer Vision and Pattern Recognition · Computer Science 2025-10-24 Minseok Kang , Minhyeok Lee , Minjung Kim , Donghyeong Kim , Sangyoun Lee

Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding

Video Paragraph Grounding (VPG) is an emerging task in video-language understanding, which aims at localizing multiple sentences with semantic relations and temporal order from an untrimmed video. However, existing VPG approaches are…

Computer Vision and Pattern Recognition · Computer Science 2024-05-15 Chaolei Tan , Jianhuang Lai , Wei-Shi Zheng , Jian-Fang Hu

Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of Sentence in Video

In this paper, we study the problem of weakly-supervised temporal grounding of sentence in video. Specifically, given an untrimmed video and a query sentence, our goal is to localize a temporal segment in the video that semantically…

Computer Vision and Pattern Recognition · Computer Science 2020-01-28 Zhenfang Chen , Lin Ma , Wenhan Luo , Peng Tang , Kwan-Yee K. Wong

A Survey on Temporal Sentence Grounding in Videos

Temporal sentence grounding in videos(TSGV), which aims to localize one target segment from an untrimmed video with respect to a given sentence query, has drawn increasing attentions in the research community over the past few years.…

Computer Vision and Pattern Recognition · Computer Science 2021-09-20 Xiaohan Lan , Yitian Yuan , Xin Wang , Zhi Wang , Wenwu Zhu

Weakly Supervised Temporal Sentence Grounding via Positive Sample Mining

The task of weakly supervised temporal sentence grounding (WSTSG) aims to detect temporal intervals corresponding to a language description from untrimmed videos with only video-level video-language correspondence. For an anchor sample,…

Computer Vision and Pattern Recognition · Computer Science 2025-05-13 Lu Dong , Haiyu Zhang , Hongjie Zhang , Yifei Huang , Zhen-Hua Ling , Yu Qiao , Limin Wang , Yali Wang

Boosting Temporal Sentence Grounding via Causal Inference

Temporal Sentence Grounding (TSG) aims to identify relevant moments in an untrimmed video that semantically correspond to a given textual query. Despite existing studies having made substantial progress, they often overlook the issue of…

Computer Vision and Pattern Recognition · Computer Science 2025-08-26 Kefan Tang , Lihuo He , Jisheng Dang , Xinbo Gao

Weakly-Supervised Video Object Grounding from Text by Loss Weighting and Object Interaction

We study weakly-supervised video object grounding: given a video segment and a corresponding descriptive sentence, the goal is to localize objects that are mentioned from the sentence in the video. During training, no object bounding boxes…

Computer Vision and Pattern Recognition · Computer Science 2018-07-23 Luowei Zhou , Nathan Louis , Jason J. Corso

Let Me Finish My Sentence: Video Temporal Grounding with Holistic Text Understanding

Video Temporal Grounding (VTG) aims to identify visual frames in a video clip that match text queries. Recent studies in VTG employ cross-attention to correlate visual frames and text queries as individual token sequences. However, these…

Computer Vision and Pattern Recognition · Computer Science 2024-10-18 Jongbhin Woo , Hyeonggon Ryu , Youngjoon Jang , Jae Won Cho , Joon Son Chung

UniVTG: Towards Unified Video-Language Temporal Grounding

Video Temporal Grounding (VTG), which aims to ground target clips from videos (such as consecutive intervals or disjoint shots) according to custom language queries (e.g., sentences or words), is key for video browsing on social media. Most…

Computer Vision and Pattern Recognition · Computer Science 2023-08-21 Kevin Qinghong Lin , Pengchuan Zhang , Joya Chen , Shraman Pramanick , Difei Gao , Alex Jinpeng Wang , Rui Yan , Mike Zheng Shou

A Closer Look at Temporal Sentence Grounding in Videos: Dataset and Metric

Temporal Sentence Grounding in Videos (TSGV), i.e., grounding a natural language sentence which indicates complex human activities in a long and untrimmed video sequence, has received unprecedented attentions over the last few years.…

Computer Vision and Pattern Recognition · Computer Science 2021-09-23 Yitian Yuan , Xiaohan Lan , Xin Wang , Long Chen , Zhi Wang , Wenwu Zhu

Multi-Pair Temporal Sentence Grounding via Multi-Thread Knowledge Transfer Network

Given some video-query pairs with untrimmed videos and sentence queries, temporal sentence grounding (TSG) aims to locate query-relevant segments in these videos. Although previous respectable TSG methods have achieved remarkable success,…

Computer Vision and Pattern Recognition · Computer Science 2026-05-26 Xiang Fang , Wanlong Fang , Changshuo Wang , Daizong Liu , Keke Tang , Jianfeng Dong , Pan Zhou , Beibei Li

Towards Weakly Supervised Text-to-Audio Grounding

Text-to-audio grounding (TAG) task aims to predict the onsets and offsets of sound events described by natural language. This task can facilitate applications such as multimodal information retrieval. This paper focuses on weakly-supervised…

Sound · Computer Science 2024-07-18 Xuenan Xu , Ziyang Ma , Mengyue Wu , Kai Yu

Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective

This paper addresses the challenging task of weakly-supervised video temporal grounding. Existing approaches are generally based on the moment proposal selection framework that utilizes contrastive learning and reconstruction paradigm for…

Computer Vision and Pattern Recognition · Computer Science 2026-05-27 Xiang Fang , Zeyu Xiong , Wanlong Fang , Xiaoye Qu , Chen Chen , Jianfeng Dong , Keke Tang , Pan Zhou , Yu Cheng , Daizong Liu

Weakly-Supervised Video Object Grounding via Causal Intervention

We target at the task of weakly-supervised video object grounding (WSVOG), where only video-sentence annotations are available during model learning. It aims to localize objects described in the sentence to visual regions in the video,…

Computer Vision and Pattern Recognition · Computer Science 2021-12-02 Wei Wang , Junyu Gao , Changsheng Xu

Dual-task Mutual Reinforcing Embedded Joint Video Paragraph Retrieval and Grounding

Video Paragraph Grounding (VPG) aims to precisely locate the most appropriate moments within a video that are relevant to a given textual paragraph query. However, existing methods typically rely on large-scale annotated temporal labels and…

Computer Vision and Pattern Recognition · Computer Science 2024-11-27 Mengzhao Wang , Huafeng Li , Yafei Zhang , Jinxing Li , Minghong Xie , Dapeng Tao

Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding

Temporal language grounding (TLG) aims to localize a video segment in an untrimmed video based on a natural language description. To alleviate the expensive cost of manual annotations for temporal boundary labels, we are dedicated to the…

Computer Vision and Pattern Recognition · Computer Science 2022-10-24 Yuechen Wang , Wengang Zhou , Houqiang Li

Weakly-Supervised Multi-Level Attentional Reconstruction Network for Grounding Textual Queries in Videos

The task of temporally grounding textual queries in videos is to localize one video segment that semantically corresponds to the given query. Most of the existing approaches rely on segment-sentence pairs (temporal annotations) for…

Computer Vision and Pattern Recognition · Computer Science 2020-03-17 Yijun Song , Jingwen Wang , Lin Ma , Zhou Yu , Jun Yu

TAG: A Simple Yet Effective Temporal-Aware Approach for Zero-Shot Video Temporal Grounding

Video Temporal Grounding (VTG) aims to extract relevant video segments based on a given natural language query. Recently, zero-shot VTG methods have gained attention by leveraging pretrained vision-language models (VLMs) to localize target…

Computer Vision and Pattern Recognition · Computer Science 2025-08-12 Jin-Seop Lee , SungJoon Lee , Jaehan Ahn , YunSeok Choi , Jee-Hyong Lee

UniversalVTG: A Universal and Lightweight Foundation Model for Video Temporal Grounding

Video temporal grounding (VTG) is typically tackled with dataset-specific models that transfer poorly across domains and query styles. Recent efforts to overcome this limitation have adapted large multimodal language models (MLLMs) to VTG,…

Computer Vision and Pattern Recognition · Computer Science 2026-04-10 Joungbin An , Agrim Jain , Kristen Grauman

Weakly Supervised Temporal Action Localization Through Learning Explicit Subspaces for Action and Context

Weakly-supervised Temporal Action Localization (WS-TAL) methods learn to localize temporal starts and ends of action instances in a video under only video-level supervision. Existing WS-TAL methods rely on deep features learned for action…

Computer Vision and Pattern Recognition · Computer Science 2021-03-31 Ziyi Liu , Le Wang , Wei Tang , Junsong Yuan , Nanning Zheng , Gang Hua