Related papers: Driver Activity Classification Using Generalizable…

Vision-Language Models can Identify Distracted Driver Behavior from Naturalistic Videos

Recognizing the activities causing distraction in real-world driving scenarios is critical for ensuring the safety and reliability of both drivers and pedestrians on the roadways. Conventional computer vision techniques are typically…

Computer Vision and Pattern Recognition · Computer Science 2024-03-22 Md Zahid Hasan , Jiajing Chen , Jiyang Wang , Mohammed Shaiqur Rahman , Ameya Joshi , Senem Velipasalar , Chinmay Hegde , Anuj Sharma , Soumik Sarkar

Vision and Language: Novel Representations and Artificial intelligence for Driving Scene Safety Assessment and Autonomous Vehicle Planning

Vision-language models (VLMs) have recently emerged as powerful representation learning systems that align visual observations with natural language concepts, offering new opportunities for semantic reasoning in safety-critical autonomous…

Computer Vision and Pattern Recognition · Computer Science 2026-02-19 Ross Greer , Maitrayee Keskar , Angel Martinez-Sanchez , Parthib Roy , Shashank Shriram , Mohan Trivedi

Toward Automatic Safe Driving Instruction: A Large-Scale Vision Language Model Approach

Large-scale Vision Language Models (LVLMs) exhibit advanced capabilities in tasks that require visual information, including object detection. These capabilities have promising applications in various industrial domains, such as autonomous…

Computer Vision and Pattern Recognition · Computer Science 2025-12-01 Haruki Sakajo , Hiroshi Takato , Hiroshi Tsutsui , Komei Soda , Hidetaka Kamigaito , Taro Watanabe

INSIGHT: Enhancing Autonomous Driving Safety through Vision-Language Models on Context-Aware Hazard Detection and Edge Case Evaluation

Autonomous driving systems face significant challenges in handling unpredictable edge-case scenarios, such as adversarial pedestrian movements, dangerous vehicle maneuvers, and sudden environmental changes. Current end-to-end driving models…

Computer Vision and Pattern Recognition · Computer Science 2026-03-30 Dianwei Chen , Zifan Zhang , Lei Cheng , Yuchen Liu , Xianfeng Terry Yang

Boosting Real-Time Driving Scene Parsing with Shared Semantics

Real-time scene parsing is a fundamental feature for autonomous driving vehicles with multiple cameras. In this letter we demonstrate that sharing semantics between cameras with different perspectives and overlapped views can boost the…

Computer Vision and Pattern Recognition · Computer Science 2020-01-14 Zhenzhen Xiang , Anbo Bao , Jie Li , Jianbo Su

Application of Vision-Language Model to Pedestrians Behavior and Scene Understanding in Autonomous Driving

Vision-language models (VLMs) have become a promising approach to enhancing perception and decision-making in autonomous driving. The gap remains in applying VLMs to understand complex scenarios interacting with pedestrians and efficient…

Computer Vision and Pattern Recognition · Computer Science 2025-07-31 Haoxiang Gao , Li Zhang , Yu Zhao , Zhou Yang , Jinghan Cao

Reasoning-VLA: A Fast and General Vision-Language-Action Reasoning Model for Autonomous Driving

Vision-Language-Action (VLA) models have recently shown strong decision-making capabilities in autonomous driving. However, existing VLAs often struggle with achieving efficient inference and generalizing to novel autonomous vehicle…

Computer Vision and Pattern Recognition · Computer Science 2025-11-26 Dapeng Zhang , Zhenlong Yuan , Zhangquan Chen , Chih-Ting Liao , Yinda Chen , Fei Shen , Qingguo Zhou , Tat-Seng Chua

Object-Centric Action-Enhanced Representations for Robot Visuo-Motor Policy Learning

Learning visual representations from observing actions to benefit robot visuo-motor policy generation is a promising direction that closely resembles human cognitive function and perception. Motivated by this, and further inspired by…

Robotics · Computer Science 2025-05-28 Nikos Giannakakis , Argyris Manetas , Panagiotis P. Filntisis , Petros Maragos , George Retsinas

Advancing Vision-based Human Action Recognition: Exploring Vision-Language CLIP Model for Generalisation in Domain-Independent Tasks

Human action recognition plays a critical role in healthcare and medicine, supporting applications such as patient behavior monitoring, fall detection, surgical robot supervision, and procedural skill assessment. While traditional models…

Computer Vision and Pattern Recognition · Computer Science 2025-08-01 Utkarsh Shandilya , Marsha Mariya Kappan , Sanyam Jain , Vijeta Sharma

LVDrive: Latent Visual Representation Enhanced Vision-Language-Action Autonomous Driving Model

Vision-Language-Action (VLA) models have emerged as a promising framework for end-to-end autonomous driving. However, existing VLAs typically rely on sparse action supervision, which underutilizes their powerful scene understanding and…

Computer Vision and Pattern Recognition · Computer Science 2026-05-22 Xiaodong Mei , Diankun Zhang , Hongwei Xie , Guang Chen , Hangjun Ye , Dan Xu

Probing Visual Concepts in Lightweight Vision-Language Models for Automated Driving

The use of Vision-Language Models (VLMs) in automated driving applications is becoming increasingly common, with the aim of leveraging their reasoning and generalisation capabilities to handle long tail scenarios. However, these models…

Computer Vision and Pattern Recognition · Computer Science 2026-03-09 Nikos Theodoridis , Reenu Mohandas , Ganesh Sistu , Anthony Scanlan , Ciarán Eising , Tim Brophy

Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models

Utilizing vision and language models (VLMs) pre-trained on large-scale image-text pairs is becoming a promising paradigm for open-vocabulary visual recognition. In this work, we extend this paradigm by leveraging motion and audio that…

Computer Vision and Pattern Recognition · Computer Science 2022-07-18 Rui Qian , Yeqing Li , Zheng Xu , Ming-Hsuan Yang , Serge Belongie , Yin Cui

Steerable Visual Representations

Pretrained Vision Transformers (ViTs) such as DINOv2 and MAE provide generic image features that can be applied to a variety of downstream tasks such as retrieval, classification, and segmentation. However, such representations tend to…

Computer Vision and Pattern Recognition · Computer Science 2026-04-03 Jona Ruthardt , Manu Gaur , Deva Ramanan , Makarand Tapaswi , Yuki M. Asano

LLM-MLFFN: Multi-Level Autonomous Driving Behavior Feature Fusion via Large Language Model

Accurate classification of autonomous vehicle (AV) driving behaviors is critical for safety validation, performance diagnosis, and traffic integration analysis. However, existing approaches primarily rely on numerical time-series modeling…

Artificial Intelligence · Computer Science 2026-03-04 Xiangyu Li , Tianyi Wang , Xi Cheng , Rakesh Chowdary Machineni , Zhaomiao Guo , Sikai Chen , Junfeng Jiao , Christian Claudel

VLP: Vision Language Planning for Autonomous Driving

Autonomous driving is a complex and challenging task that aims at safe motion planning through scene understanding and reasoning. While vision-only autonomous driving methods have recently achieved notable performance, through enhanced…

Computer Vision and Pattern Recognition · Computer Science 2024-11-26 Chenbin Pan , Burhaneddin Yaman , Tommaso Nesti , Abhirup Mallik , Alessandro G Allievi , Senem Velipasalar , Liu Ren

Representing visual classification as a linear combination of words

Explainability is a longstanding challenge in deep learning, especially in high-stakes domains like healthcare. Common explainability methods highlight image regions that drive an AI model's decision. Humans, however, heavily rely on…

Artificial Intelligence · Computer Science 2023-11-21 Shobhit Agarwal , Yevgeniy R. Semenov , William Lotter

LMAD: Integrated End-to-End Vision-Language Model for Explainable Autonomous Driving

Large vision-language models (VLMs) have shown promising capabilities in scene understanding, enhancing the explainability of driving behaviors and interactivity with users. Existing methods primarily fine-tune VLMs on on-board multi-view…

Computer Vision and Pattern Recognition · Computer Science 2025-08-19 Nan Song , Bozhou Zhang , Xiatian Zhu , Jiankang Deng , Li Zhang

Driver-Net: Multi-Camera Fusion for Assessing Driver Take-Over Readiness in Automated Vehicles

Ensuring safe transition of control in automated vehicles requires an accurate and timely assessment of driver readiness. This paper introduces Driver-Net, a novel deep learning framework that fuses multi-camera inputs to estimate driver…

Computer Vision and Pattern Recognition · Computer Science 2025-09-09 Mahdi Rezaei , Mohsen Azarmi

Visuomotor Understanding for Representation Learning of Driving Scenes

Dashboard cameras capture a tremendous amount of driving scene video each day. These videos are purposefully coupled with vehicle sensing data, such as from the speedometer and inertial sensors, providing an additional sensing modality for…

Computer Vision and Pattern Recognition · Computer Science 2019-09-17 Seokju Lee , Junsik Kim , Tae-Hyun Oh , Yongseop Jeong , Donggeun Yoo , Stephen Lin , In So Kweon

A Survey on Vision-Language-Action Models for Autonomous Driving

The rapid progress of multimodal large language models (MLLM) has paved the way for Vision-Language-Action (VLA) paradigms, which integrate visual perception, natural language understanding, and control within a single policy. Researchers…

Computer Vision and Pattern Recognition · Computer Science 2025-07-01 Sicong Jiang , Zilin Huang , Kangan Qian , Ziang Luo , Tianze Zhu , Yang Zhong , Yihong Tang , Menglin Kong , Yunlong Wang , Siwen Jiao , Hao Ye , Zihao Sheng , Xin Zhao , Tuopu Wen , Zheng Fu , Sikai Chen , Kun Jiang , Diange Yang , Seongjin Choi , Lijun Sun