Related papers: Do Language Models Understand Time?

Enhancing Temporal Understanding in Video-LLMs through Stacked Temporal Attention in Vision Encoders

Despite significant advances in Multimodal Large Language Models (MLLMs), understanding complex temporal dynamics in videos remains a major challenge. Our experiments show that current Video Large Language Model (Video-LLM) architectures…

Computer Vision and Pattern Recognition · Computer Science 2025-10-31 Ali Rasekh , Erfan Bagheri Soula , Omid Daliran , Simon Gottschalk , Mohsen Fayyaz

ST-LLM: Large Language Models Are Effective Temporal Learners

Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation, prompting research efforts towards video LLMs to facilitate human-AI interaction at the video level. However, how to effectively…

Computer Vision and Pattern Recognition · Computer Science 2024-04-02 Ruyang Liu , Chen Li , Haoran Tang , Yixiao Ge , Ying Shan , Ge Li

From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding

The integration of Large Language Models (LLMs) with visual encoders has recently shown promising performance in visual understanding tasks, leveraging their inherent capability to comprehend and generate human-like text for visual…

Computer Vision and Pattern Recognition · Computer Science 2024-12-04 Heqing Zou , Tianze Luo , Guiyang Xie , Victor , Zhang , Fengmao Lv , Guangcong Wang , Junyang Chen , Zhuochen Wang , Hansheng Zhang , Huaijian Zhang

TemporalVLM: Video LLMs for Temporal Reasoning in Long Videos

We introduce TemporalVLM, a video large language model (video LLM) for temporal reasoning and fine-grained understanding in long videos. Our approach includes a visual encoder for mapping a long-term video into features which are time-aware…

Computer Vision and Pattern Recognition · Computer Science 2026-04-21 Fawad Javed Fateh , Umer Ahmed , Hamza Khan , M. Zeeshan Zia , Quoc-Huy Tran

Video Understanding with Large Language Models: A Survey

With the burgeoning growth of online video platforms and the escalating volume of video content, the demand for proficient video understanding tools has intensified markedly. Given the remarkable capabilities of large language models (LLMs)…

Computer Vision and Pattern Recognition · Computer Science 2025-11-26 Yolo Y. Tang , Jing Bi , Siting Xu , Luchuan Song , Susan Liang , Teng Wang , Daoan Zhang , Jie An , Jingyang Lin , Rongyi Zhu , Ali Vosoughi , Chao Huang , Zeliang Zhang , Pinxin Liu , Mingqian Feng , Feng Zheng , Jianguo Zhang , Ping Luo , Jiebo Luo , Chenliang Xu

VTimeLLM: Empower LLM to Grasp Video Moments

Large language models (LLMs) have shown remarkable text understanding capabilities, which have been extended as Video LLMs to handle video data for comprehending visual details. However, existing Video LLMs can only provide a coarse…

Computer Vision and Pattern Recognition · Computer Science 2023-12-01 Bin Huang , Xin Wang , Hong Chen , Zihan Song , Wenwu Zhu

LongVLM: Efficient Long Video Understanding via Large Language Models

Empowered by Large Language Models (LLMs), recent advancements in Video-based LLMs (VideoLLMs) have driven progress in various video understanding tasks. These models encode video representations through pooling or query aggregation over a…

Computer Vision and Pattern Recognition · Computer Science 2024-07-23 Yuetian Weng , Mingfei Han , Haoyu He , Xiaojun Chang , Bohan Zhuang

Temporal-Oriented Recipe for Transferring Large Vision-Language Model to Video Understanding

Recent years have witnessed outstanding advances of large vision-language models (LVLMs). In order to tackle video understanding, most of them depend upon their implicit temporal understanding capacity. As such, they have not deciphered…

Computer Vision and Pattern Recognition · Computer Science 2025-05-20 Thong Nguyen , Zhiyuan Hu , Xu Lin , Cong-Duy Nguyen , See-Kiong Ng , Luu Anh Tuan

TempCompass: Do Video LLMs Really Understand Videos?

Recently, there is a surge in interest surrounding video large language models (Video LLMs). However, existing benchmarks fail to provide a comprehensive feedback on the temporal perception ability of Video LLMs. On the one hand, most of…

Computer Vision and Pattern Recognition · Computer Science 2024-06-04 Yuanxin Liu , Shicheng Li , Yi Liu , Yuxiang Wang , Shuhuai Ren , Lei Li , Sishuo Chen , Xu Sun , Lu Hou

VideoLLM: Modeling Video Sequence with Large Language Models

With the exponential growth of video data, there is an urgent need for automated technology to analyze and comprehend video content. However, existing video understanding models are often task-specific and lack a comprehensive capability of…

Computer Vision and Pattern Recognition · Computer Science 2023-05-24 Guo Chen , Yin-Dong Zheng , Jiahao Wang , Jilan Xu , Yifei Huang , Junting Pan , Yi Wang , Yali Wang , Yu Qiao , Tong Lu , Limin Wang

How Important are Videos for Training Video LLMs?

Research into Video Large Language Models (LLMs) has progressed rapidly, with numerous models and benchmarks emerging in just a few years. Typically, these models are initialized with a pretrained text-only LLM and finetuned on both image-…

Computer Vision and Pattern Recognition · Computer Science 2025-06-10 George Lydakis , Alexander Hermans , Ali Athar , Daan de Geus , Bastian Leibe

Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models

Video Large Language Models (Video-LLMs) have demonstrated remarkable capabilities in coarse-grained video understanding, however, they struggle with fine-grained temporal grounding. In this paper, we introduce Grounded-VideoLLM, a novel…

Computer Vision and Pattern Recognition · Computer Science 2025-08-22 Haibo Wang , Zhiyang Xu , Yu Cheng , Shizhe Diao , Yufan Zhou , Yixin Cao , Qifan Wang , Weifeng Ge , Lifu Huang

Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large…

Computer Vision and Pattern Recognition · Computer Science 2025-11-26 Yolo Y. Tang , Jing Bi , Pinxin Liu , Zhenyu Pan , Zhangyun Tan , Qianxiang Shen , Jiani Liu , Hang Hua , Junjia Guo , Yunzhong Xiao , Chao Huang , Zhiyuan Wang , Susan Liang , Xinyi Liu , Yizhi Song , Junhua Huang , Jia-Xing Zhong , Bozheng Li , Daiqing Qi , Ziyun Zeng , Ali Vosoughi , Luchuan Song , Zeliang Zhang , Daiki Shimada , Han Liu , Jiebo Luo , Chenliang Xu

TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability

Rapid development of large language models (LLMs) has significantly advanced multimodal large language models (LMMs), particularly in vision-language tasks. However, existing video-language models often overlook precise temporal…

Computer Vision and Pattern Recognition · Computer Science 2024-11-28 Shimin Chen , Xiaohan Lan , Yitian Yuan , Zequn Jie , Lin Ma

Survey of different Large Language Model Architectures: Trends, Benchmarks, and Challenges

Large Language Models (LLMs) represent a class of deep learning models adept at understanding natural language and generating coherent responses to various prompts or queries. These models far exceed the complexity of conventional neural…

Machine Learning · Computer Science 2024-12-05 Minghao Shao , Abdul Basit , Ramesh Karri , Muhammad Shafique

Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding?

Existing video understanding benchmarks often conflate knowledge-based and purely image-based questions, rather than clearly isolating a model's temporal reasoning ability, which is the key aspect that distinguishes video understanding from…

Computer Vision and Pattern Recognition · Computer Science 2025-05-21 Bo Feng , Zhengfeng Lai , Shiyu Li , Zizhen Wang , Simon Wang , Ping Huang , Meng Cao

VideoQA in the Era of LLMs: An Empirical Study

Video Large Language Models (Video-LLMs) are flourishing and has advanced many video-language tasks. As a golden testbed, Video Question Answering (VideoQA) plays pivotal role in Video-LLM developing. This work conducts a timely and…

Computer Vision and Pattern Recognition · Computer Science 2025-06-17 Junbin Xiao , Nanxin Huang , Hangyu Qin , Dongyang Li , Yicong Li , Fengbin Zhu , Zhulin Tao , Jianxing Yu , Liang Lin , Tat-Seng Chua , Angela Yao

Empowering Video Translation using Multimodal Large Language Models

Recent developments in video translation have further enhanced cross-lingual access to video content, with multimodal large language models (MLLMs) playing an increasingly important supporting role. With strong multimodal understanding,…

Computer Vision and Pattern Recognition · Computer Science 2026-04-14 Bingzheng QU , Kehai Chen , Xuefeng Bai , Min Zhang

On the Consistency of Video Large Language Models in Temporal Comprehension

Video large language models (Video-LLMs) can temporally ground language queries and retrieve video moments. Yet, such temporal comprehension capabilities are neither well-studied nor understood. So we conduct a study on prediction…

Computer Vision and Pattern Recognition · Computer Science 2025-03-18 Minjoon Jung , Junbin Xiao , Byoung-Tak Zhang , Angela Yao

VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

Building on the advances of language models, Large Multimodal Models (LMMs) have contributed significant improvements in video understanding. While the current video LMMs utilize advanced Large Language Models (LLMs), they rely on either…

Computer Vision and Pattern Recognition · Computer Science 2024-06-14 Muhammad Maaz , Hanoona Rasheed , Salman Khan , Fahad Khan