Related papers: Multi-Modality Driven LoRA for Adverse Condition D…
Fusing sensors with complementary modalities is crucial for maintaining a stable and comprehensive understanding of abnormal driving scenes. However, Multimodal Large Language Models (MLLMs) are underexplored for leveraging multi-sensor…
Integrating different representations from complementary sensing modalities is crucial for robust scene interpretation in autonomous driving. While deep learning architectures that fuse vision and range data for 2D object detection have…
Vision-language models (VLMs) such as CLIP demonstrate strong generalization in zero-shot classification but remain highly vulnerable to adversarial perturbations. Existing methods primarily focus on adversarial fine-tuning or prompt…
We focus on bridging domain discrepancy in lane detection among different scenarios to greatly reduce extra annotation and re-training costs for autonomous driving. Critical factors hinder the performance improvement of cross-domain lane…
Monocular depth estimation is critical for applications such as autonomous driving and scene reconstruction. While existing methods perform well under normal scenarios, their performance declines in adverse weather, due to challenging…
Domain adaptation methods for object detection (OD) strive to mitigate the impact of distribution shifts by promoting feature alignment across source and target domains. Multi-source domain adaptation (MSDA) allows leveraging multiple…
Visual object tracking has gained promising progress in past decades. Most of the existing approaches focus on learning target representation in well-conditioned daytime data, while for the unconstrained real-world scenarios with adverse…
The fusion of multimodal sensor data streams such as camera images and lidar point clouds plays an important role in the operation of autonomous vehicles (AVs). Robust perception across a range of adverse weather and lighting conditions is…
The success of CLIP-like vision-language models (VLMs) on natural images has inspired medical counterparts, yet existing approaches largely fall into two extremes: specialist models trained on single-domain data, which capture…
We focus on improving the visual understanding capability for boosting the vision-language models. We propose \textbf{Arcana}, a multiModal language model, which introduces two crucial techniques. First, we present Multimodal LoRA…
High infraction rates remain the primary bottleneck for end-to-end (E2E) autonomous driving, as evidenced by the low driving scores on the CARLA Leaderboard. Despite collision-related infractions being the dominant failure mode in…
Vision-Language Models (VLMs) are foundational to critical applications like autonomous driving, medical diagnosis, and content moderation. While Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA enable their efficient adaptation to…
All-weather image restoration (AWIR) is crucial for reliable autonomous navigation under adverse weather conditions. AWIR models are trained to address a specific set of weather conditions such as fog, rain, and snow. But this causes them…
Vision-Language Models (VLMs) achieve strong cross-modal performance, yet recent evidence suggests they over-rely on textual descriptions while under-utilizing visual evidence -- a phenomenon termed ``text shortcut learning.'' We propose an…
Conditional inference on joint textual and visual clues is a multi-modal reasoning task that textual clues provide prior permutation or external knowledge, which are complementary with visual content and pivotal to deducing the correct…
Autonomous driving is challenging in adverse road and weather conditions in which there might not be lane lines, the road might be covered in snow and the visibility might be poor. We extend the previous work on end-to-end learning for…
General object detection (OD) struggles to detect objects in the target domain that differ from the training distribution. To address this, recent studies demonstrate that training from multiple source domains and explicitly processing them…
Loop closure detection (LCD) is the key module in appearance based simultaneously localization and mapping (SLAM). However, in the real life, the appearance of visual inputs are usually affected by the illumination changes and texture…
Multimodal embedding models, built upon causal Vision Language Models (VLMs), have shown promise in various tasks. However, current approaches face three key limitations: the use of causal attention in VLM backbones is suboptimal for…
Powered by deep representation learning, reinforcement learning (RL) provides an end-to-end learning framework capable of solving self-driving (SD) tasks without manual designs. However, time-varying nonstationary environments cause…