English
Related papers

Related papers: Conceptual Codebook Learning for Vision-Language M…

200 papers

Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP, establish the correlation between texts and images, achieving remarkable success on various downstream tasks with fine-tuning. In existing fine-tuning methods, the…

Computer Vision and Pattern Recognition · Computer Science 2023-07-31 Yi Zhang , Ce Zhang , Yushun Tang , Zhihai He

Compositional reasoning remains a persistent weakness of modern vision language models (VLMs): they often falter when a task hinges on understanding how multiple objects, attributes, and relations interact within an image. Multiple research…

Computer Vision and Pattern Recognition · Computer Science 2025-10-14 Sanchit Sinha , Guangzhi Xiong , Aidong Zhang

Contrastive Language-Image Pretraining (CLIP) model has exhibited remarkable efficacy in establishing cross-modal connections between texts and images, yielding impressive performance across a broad spectrum of downstream applications…

Computer Vision and Pattern Recognition · Computer Science 2024-01-17 Yi Zhang , Ce Zhang , Ke Yu , Yushun Tang , Zhihai He

Humans possess the remarkable skill of Visual Perception, the ability to see and understand the seen, helping them make sense of the visual world and, in turn, reason. Multimodal Large Language Models (MLLM) have recently achieved…

Computer Vision and Pattern Recognition · Computer Science 2023-12-25 Jitesh Jain , Jianwei Yang , Humphrey Shi

Large language models (LLMs) have shown promising results for software engineering applications, but still struggle with code reasoning tasks such as vulnerability detection (VD). We introduce ConceptCoder, a fine-tuning method that…

Software Engineering · Computer Science 2026-03-25 Md Mahbubur Rahman , Hengbo Tong , Wei Le

Vision-language models (VLMs) pre-trained on natural image and language data, such as CLIP, have exhibited significant potential in few-shot image recognition tasks, leading to development of various efficient transfer learning methods.…

Computer Vision and Pattern Recognition · Computer Science 2025-08-19 Dexia Chen , Wentao Zhang , Qianjie Zhu , Ping Hu , Weibing Li , Tong Zhang , Ruixuan Wang

Large Language Models (LLMs) have achieved remarkable success in source code understanding, yet as software systems grow in scale, computational efficiency has become a critical bottleneck. Currently, these models rely on a text-based…

Computation and Language · Computer Science 2026-04-29 Yuling Shi , Chaoxiang Xie , Zhensu Sun , Yeheng Chen , Chenxu Zhang , Longfei Yun , Chengcheng Wan , Hongyu Zhang , David Lo , Xiaodong Gu

Video Language Models (VideoLMs) enable AI systems to understand temporal dynamics in videos. To fit within the maximum context window constraint, current methods use keyframe sampling which often misses both macro-level events and…

Computer Vision and Pattern Recognition · Computer Science 2026-03-31 Sayan Deb Sarkar , Rémi Pautrat , Ondrej Miksik , Marc Pollefeys , Iro Armeni , Mahdi Rad , Mihai Dusmanu

Recent years have witnessed a significant increase in the performance of Vision and Language tasks. Foundational Vision-Language Models (VLMs), such as CLIP, have been leveraged in multiple settings and demonstrated remarkable performance…

Computer Vision and Pattern Recognition · Computer Science 2024-03-04 Santiago Castro , Amir Ziai , Avneesh Saluja , Zhuoning Yuan , Rada Mihalcea

Language models (LMs) and their extension, vision-language models (VLMs), have achieved remarkable performance across various tasks. However, they still struggle with complex reasoning tasks that require multimodal or multilingual…

Machine Learning · Computer Science 2025-07-09 Wenyi Wu , Zixuan Song , Kun Zhou , Yifei Shao , Zhiting Hu , Biwei Huang

Vision-Language Models (VLMs) have demonstrated their broad effectiveness thanks to extensive training in aligning visual instructions to responses. However, such training of conclusive alignment leads models to ignore essential visual…

Computer Vision and Pattern Recognition · Computer Science 2025-03-04 Ji Qi , Ming Ding , Weihan Wang , Yushi Bai , Qingsong Lv , Wenyi Hong , Bin Xu , Lei Hou , Juanzi Li , Yuxiao Dong , Jie Tang

Large Vision-Language Models (LVLMs) are pivotal for real-world AI tasks like embodied intelligence due to their strong vision-language reasoning abilities. However, current LVLMs process entire images at the token level, which is…

Computation and Language · Computer Science 2025-05-20 Run Luo , Renke Shan , Longze Chen , Ziqiang Liu , Lu Wang , Min Yang , Xiaobo Xia

Language modality within the vision language pretraining framework is innately discretized, endowing each word in the language vocabulary a semantic meaning. In contrast, visual modality is inherently continuous and high-dimensional, which…

Computer Vision and Pattern Recognition · Computer Science 2022-08-02 Xiaoyuan Guo , Jiali Duan , C. -C. Jay Kuo , Judy Wawira Gichoya , Imon Banerjee

Recent vision-language models (VLMs) typically rely on a single vision encoder trained with contrastive image-text objectives, such as CLIP-style pretraining. While contrastive encoders are effective for cross-modal alignment and retrieval,…

Computer Vision and Pattern Recognition · Computer Science 2026-04-06 Ankan Deria , Komal Kumar , Xilin He , Imran Razzak , Hisham Cholakkal , Fahad Shahbaz Khan , Salman Khan

Video large language models (Vid-LLMs), which excel in diverse video-language tasks, can be effectively constructed by adapting image-pretrained vision-language models (VLMs). However, this adaptation remains challenging, as it requires…

Computer Vision and Pattern Recognition · Computer Science 2025-10-13 Yiyang Huang , Yizhou Wang , Yun Fu

State-of-the-art Vision-Language Models (VLMs) ground the vision and the language modality primarily via projecting the vision tokens from the encoder to language-like tokens, which are directly fed to the Large Language Model (LLM)…

Computer Vision and Pattern Recognition · Computer Science 2024-07-18 Sivan Doveh , Shaked Perek , M. Jehanzeb Mirza , Wei Lin , Amit Alfassy , Assaf Arbelle , Shimon Ullman , Leonid Karlinsky

Recent advances in multimodal learning has resulted in powerful vision-language models, whose representations are generalizable across a variety of downstream tasks. Recently, their generalization ability has been further extended by…

Computer Vision and Pattern Recognition · Computer Science 2023-12-13 Koustava Goswami , Srikrishna Karanam , Prateksha Udhayanan , K J Joseph , Balaji Vasan Srinivasan

This paper presents CoLLIE: a simple, yet effective model for continual learning of how language is grounded in vision. Given a pre-trained multimodal embedding model, where language and images are projected in the same semantic space (in…

Computation and Language · Computer Science 2022-07-12 Gabriel Skantze , Bram Willemsen

Recent generalist vision-language models (VLMs) have demonstrated impressive reasoning capabilities across diverse multimodal tasks. However, these models still struggle with fine-grained object-level understanding and grounding. In terms…

Computer Vision and Pattern Recognition · Computer Science 2024-06-04 Timothy Ossowski , Junjie Hu

Recent advances in pre-training vision-language models (VLMs), e.g., contrastive language-image pre-training (CLIP) methods, have shown great potential in learning out-of-distribution (OOD) representations. Despite showing competitive…

Computer Vision and Pattern Recognition · Computer Science 2025-09-22 Min Zhang , Bo Jiang , Jie Zhou , Yimeng Liu , Xin Lin
‹ Prev 1 2 3 10 Next ›