English
Related papers

Related papers: Do Language Models Understand Time?

200 papers

Despite significant advances in Multimodal Large Language Models (MLLMs), understanding complex temporal dynamics in videos remains a major challenge. Our experiments show that current Video Large Language Model (Video-LLM) architectures…

Computer Vision and Pattern Recognition · Computer Science 2025-10-31 Ali Rasekh , Erfan Bagheri Soula , Omid Daliran , Simon Gottschalk , Mohsen Fayyaz

Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation, prompting research efforts towards video LLMs to facilitate human-AI interaction at the video level. However, how to effectively…

Computer Vision and Pattern Recognition · Computer Science 2024-04-02 Ruyang Liu , Chen Li , Haoran Tang , Yixiao Ge , Ying Shan , Ge Li

The integration of Large Language Models (LLMs) with visual encoders has recently shown promising performance in visual understanding tasks, leveraging their inherent capability to comprehend and generate human-like text for visual…

Computer Vision and Pattern Recognition · Computer Science 2024-12-04 Heqing Zou , Tianze Luo , Guiyang Xie , Victor , Zhang , Fengmao Lv , Guangcong Wang , Junyang Chen , Zhuochen Wang , Hansheng Zhang , Huaijian Zhang

We introduce TemporalVLM, a video large language model (video LLM) for temporal reasoning and fine-grained understanding in long videos. Our approach includes a visual encoder for mapping a long-term video into features which are time-aware…

Computer Vision and Pattern Recognition · Computer Science 2026-04-21 Fawad Javed Fateh , Umer Ahmed , Hamza Khan , M. Zeeshan Zia , Quoc-Huy Tran

With the burgeoning growth of online video platforms and the escalating volume of video content, the demand for proficient video understanding tools has intensified markedly. Given the remarkable capabilities of large language models (LLMs)…

Large language models (LLMs) have shown remarkable text understanding capabilities, which have been extended as Video LLMs to handle video data for comprehending visual details. However, existing Video LLMs can only provide a coarse…

Computer Vision and Pattern Recognition · Computer Science 2023-12-01 Bin Huang , Xin Wang , Hong Chen , Zihan Song , Wenwu Zhu

Empowered by Large Language Models (LLMs), recent advancements in Video-based LLMs (VideoLLMs) have driven progress in various video understanding tasks. These models encode video representations through pooling or query aggregation over a…

Computer Vision and Pattern Recognition · Computer Science 2024-07-23 Yuetian Weng , Mingfei Han , Haoyu He , Xiaojun Chang , Bohan Zhuang

Recent years have witnessed outstanding advances of large vision-language models (LVLMs). In order to tackle video understanding, most of them depend upon their implicit temporal understanding capacity. As such, they have not deciphered…

Computer Vision and Pattern Recognition · Computer Science 2025-05-20 Thong Nguyen , Zhiyuan Hu , Xu Lin , Cong-Duy Nguyen , See-Kiong Ng , Luu Anh Tuan

Recently, there is a surge in interest surrounding video large language models (Video LLMs). However, existing benchmarks fail to provide a comprehensive feedback on the temporal perception ability of Video LLMs. On the one hand, most of…

Computer Vision and Pattern Recognition · Computer Science 2024-06-04 Yuanxin Liu , Shicheng Li , Yi Liu , Yuxiang Wang , Shuhuai Ren , Lei Li , Sishuo Chen , Xu Sun , Lu Hou

With the exponential growth of video data, there is an urgent need for automated technology to analyze and comprehend video content. However, existing video understanding models are often task-specific and lack a comprehensive capability of…

Computer Vision and Pattern Recognition · Computer Science 2023-05-24 Guo Chen , Yin-Dong Zheng , Jiahao Wang , Jilan Xu , Yifei Huang , Junting Pan , Yi Wang , Yali Wang , Yu Qiao , Tong Lu , Limin Wang

Research into Video Large Language Models (LLMs) has progressed rapidly, with numerous models and benchmarks emerging in just a few years. Typically, these models are initialized with a pretrained text-only LLM and finetuned on both image-…

Computer Vision and Pattern Recognition · Computer Science 2025-06-10 George Lydakis , Alexander Hermans , Ali Athar , Daan de Geus , Bastian Leibe

Video Large Language Models (Video-LLMs) have demonstrated remarkable capabilities in coarse-grained video understanding, however, they struggle with fine-grained temporal grounding. In this paper, we introduce Grounded-VideoLLM, a novel…

Computer Vision and Pattern Recognition · Computer Science 2025-08-22 Haibo Wang , Zhiyang Xu , Yu Cheng , Shizhe Diao , Yufan Zhou , Yixin Cao , Qifan Wang , Weifeng Ge , Lifu Huang

Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large…

Rapid development of large language models (LLMs) has significantly advanced multimodal large language models (LMMs), particularly in vision-language tasks. However, existing video-language models often overlook precise temporal…

Computer Vision and Pattern Recognition · Computer Science 2024-11-28 Shimin Chen , Xiaohan Lan , Yitian Yuan , Zequn Jie , Lin Ma

Large Language Models (LLMs) represent a class of deep learning models adept at understanding natural language and generating coherent responses to various prompts or queries. These models far exceed the complexity of conventional neural…

Machine Learning · Computer Science 2024-12-05 Minghao Shao , Abdul Basit , Ramesh Karri , Muhammad Shafique

Existing video understanding benchmarks often conflate knowledge-based and purely image-based questions, rather than clearly isolating a model's temporal reasoning ability, which is the key aspect that distinguishes video understanding from…

Computer Vision and Pattern Recognition · Computer Science 2025-05-21 Bo Feng , Zhengfeng Lai , Shiyu Li , Zizhen Wang , Simon Wang , Ping Huang , Meng Cao

Video Large Language Models (Video-LLMs) are flourishing and has advanced many video-language tasks. As a golden testbed, Video Question Answering (VideoQA) plays pivotal role in Video-LLM developing. This work conducts a timely and…

Computer Vision and Pattern Recognition · Computer Science 2025-06-17 Junbin Xiao , Nanxin Huang , Hangyu Qin , Dongyang Li , Yicong Li , Fengbin Zhu , Zhulin Tao , Jianxing Yu , Liang Lin , Tat-Seng Chua , Angela Yao

Recent developments in video translation have further enhanced cross-lingual access to video content, with multimodal large language models (MLLMs) playing an increasingly important supporting role. With strong multimodal understanding,…

Computer Vision and Pattern Recognition · Computer Science 2026-04-14 Bingzheng QU , Kehai Chen , Xuefeng Bai , Min Zhang

Video large language models (Video-LLMs) can temporally ground language queries and retrieve video moments. Yet, such temporal comprehension capabilities are neither well-studied nor understood. So we conduct a study on prediction…

Computer Vision and Pattern Recognition · Computer Science 2025-03-18 Minjoon Jung , Junbin Xiao , Byoung-Tak Zhang , Angela Yao

Building on the advances of language models, Large Multimodal Models (LMMs) have contributed significant improvements in video understanding. While the current video LMMs utilize advanced Large Language Models (LLMs), they rely on either…

Computer Vision and Pattern Recognition · Computer Science 2024-06-14 Muhammad Maaz , Hanoona Rasheed , Salman Khan , Fahad Khan
‹ Prev 1 2 3 10 Next ›