Related papers: Conceptual Codebook Learning for Vision-Language M…

Cross-Modal Concept Learning and Inference for Vision-Language Models

Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP, establish the correlation between texts and images, achieving remarkable success on various downstream tasks with fine-tuning. In existing fine-tuning methods, the…

Computer Vision and Pattern Recognition · Computer Science 2023-07-31 Yi Zhang , Ce Zhang , Yushun Tang , Zhihai He

COCO-Tree: Compositional Hierarchical Concept Trees for Enhanced Reasoning in Vision Language Models

Compositional reasoning remains a persistent weakness of modern vision language models (VLMs): they often falter when a task hinges on understanding how multiple objects, attributes, and relations interact within an image. Multiple research…

Computer Vision and Pattern Recognition · Computer Science 2025-10-14 Sanchit Sinha , Guangzhi Xiong , Aidong Zhang

Concept-Guided Prompt Learning for Generalization in Vision-Language Models

Contrastive Language-Image Pretraining (CLIP) model has exhibited remarkable efficacy in establishing cross-modal connections between texts and images, yielding impressive performance across a broad spectrum of downstream applications…

Computer Vision and Pattern Recognition · Computer Science 2024-01-17 Yi Zhang , Ce Zhang , Ke Yu , Yushun Tang , Zhihai He

VCoder: Versatile Vision Encoders for Multimodal Large Language Models

Humans possess the remarkable skill of Visual Perception, the ability to see and understand the seen, helping them make sense of the visual world and, in turn, reason. Multimodal Large Language Models (MLLM) have recently achieved…

Computer Vision and Pattern Recognition · Computer Science 2023-12-25 Jitesh Jain , Jianwei Yang , Humphrey Shi

ConceptCoder: Improve Code Reasoning via Concept Learning

Large language models (LLMs) have shown promising results for software engineering applications, but still struggle with code reasoning tasks such as vulnerability detection (VD). We introduce ConceptCoder, a fine-tuning method that…

Software Engineering · Computer Science 2026-03-25 Md Mahbubur Rahman , Hengbo Tong , Wei Le

Cross-Domain Few-Shot Learning via Multi-View Collaborative Optimization with Vision-Language Models

Vision-language models (VLMs) pre-trained on natural image and language data, such as CLIP, have exhibited significant potential in few-shot image recognition tasks, leading to development of various efficient transfer learning methods.…

Computer Vision and Pattern Recognition · Computer Science 2025-08-19 Dexia Chen , Wentao Zhang , Qianjie Zhu , Ping Hu , Weibing Li , Tong Zhang , Ruixuan Wang

CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding

Large Language Models (LLMs) have achieved remarkable success in source code understanding, yet as software systems grow in scale, computational efficiency has become a critical bottleneck. Currently, these models rely on a text-based…

Computation and Language · Computer Science 2026-04-29 Yuling Shi , Chaoxiang Xie , Zhensu Sun , Yeheng Chen , Chenxu Zhang , Longfei Yun , Chengcheng Wan , Hongyu Zhang , David Lo , Xiaodong Gu

CoPE-VideoLM: Leveraging Codec Primitives For Efficient Video Language Modeling

Video Language Models (VideoLMs) enable AI systems to understand temporal dynamics in videos. To fit within the maximum context window constraint, current methods use keyframe sampling which often misses both macro-level events and…

Computer Vision and Pattern Recognition · Computer Science 2026-03-31 Sayan Deb Sarkar , Rémi Pautrat , Ondrej Miksik , Marc Pollefeys , Iro Armeni , Mahdi Rad , Mihai Dusmanu

CLoVe: Encoding Compositional Language in Contrastive Vision-Language Models

Recent years have witnessed a significant increase in the performance of Vision and Language tasks. Foundational Vision-Language Models (VLMs), such as CLIP, have been leveraged in multiple settings and demonstrated remarkable performance…

Computer Vision and Pattern Recognition · Computer Science 2024-03-04 Santiago Castro , Amir Ziai , Avneesh Saluja , Zhuoning Yuan , Rada Mihalcea

Towards General Continuous Memory for Vision-Language Models

Language models (LMs) and their extension, vision-language models (VLMs), have achieved remarkable performance across various tasks. However, they still struggle with complex reasoning tasks that require multimodal or multilingual…

Machine Learning · Computer Science 2025-07-09 Wenyi Wu , Zixuan Song , Kun Zhou , Yifei Shao , Zhiting Hu , Biwei Huang

CogCoM: A Visual Language Model with Chain-of-Manipulations Reasoning

Vision-Language Models (VLMs) have demonstrated their broad effectiveness thanks to extensive training in aligning visual instructions to responses. However, such training of conclusive alignment leads models to ignore essential visual…

Computer Vision and Pattern Recognition · Computer Science 2025-03-04 Ji Qi , Ming Ding , Weihan Wang , Yushi Bai , Qingsong Lv , Wenyi Hong , Bin Xu , Lei Hou , Juanzi Li , Yuxiao Dong , Jie Tang

VCM: Vision Concept Modeling Based on Implicit Contrastive Learning with Vision-Language Instruction Fine-Tuning

Large Vision-Language Models (LVLMs) are pivotal for real-world AI tasks like embodied intelligence due to their strong vision-language reasoning abilities. However, current LVLMs process entire images at the token level, which is…

Computation and Language · Computer Science 2025-05-20 Run Luo , Renke Shan , Longze Chen , Ziqiang Liu , Lu Wang , Min Yang , Xiaobo Xia

Augmenting Vision Language Pretraining by Learning Codebook with Visual Semantics

Language modality within the vision language pretraining framework is innately discretized, endowing each word in the language vocabulary a semantic meaning. In contrast, visual modality is inherently continuous and high-dimensional, which…

Computer Vision and Pattern Recognition · Computer Science 2022-08-02 Xiaoyuan Guo , Jiali Duan , C. -C. Jay Kuo , Judy Wawira Gichoya , Imon Banerjee

CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

Recent vision-language models (VLMs) typically rely on a single vision encoder trained with contrastive image-text objectives, such as CLIP-style pretraining. While contrastive encoders are effective for cross-modal alignment and retrieval,…

Computer Vision and Pattern Recognition · Computer Science 2026-04-06 Ankan Deria , Komal Kumar , Xilin He , Imran Razzak , Hisham Cholakkal , Fahad Shahbaz Khan , Salman Khan

D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition

Video large language models (Vid-LLMs), which excel in diverse video-language tasks, can be effectively constructed by adapting image-pretrained vision-language models (VLMs). However, this adaptation remains challenging, as it requires…

Computer Vision and Pattern Recognition · Computer Science 2025-10-13 Yiyang Huang , Yizhou Wang , Yun Fu

Towards Multimodal In-Context Learning for Vision & Language Models

State-of-the-art Vision-Language Models (VLMs) ground the vision and the language modality primarily via projecting the vision tokens from the encoder to language-like tokens, which are directly fed to the Large Language Model (LLM)…

Computer Vision and Pattern Recognition · Computer Science 2024-07-18 Sivan Doveh , Shaked Perek , M. Jehanzeb Mirza , Wei Lin , Amit Alfassy , Assaf Arbelle , Shimon Ullman , Leonid Karlinsky

CoPL: Contextual Prompt Learning for Vision-Language Understanding

Recent advances in multimodal learning has resulted in powerful vision-language models, whose representations are generalizable across a variety of downstream tasks. Recently, their generalization ability has been further extended by…

Computer Vision and Pattern Recognition · Computer Science 2023-12-13 Koustava Goswami , Srikrishna Karanam , Prateksha Udhayanan , K J Joseph , Balaji Vasan Srinivasan

CoLLIE: Continual Learning of Language Grounding from Language-Image Embeddings

This paper presents CoLLIE: a simple, yet effective model for continual learning of how language is grounded in vision. Given a pre-trained multimodal embedding model, where language and images are projected in the same semantic space (in…

Computation and Language · Computer Science 2022-07-12 Gabriel Skantze , Bram Willemsen

OLIVE: Object Level In-Context Visual Embeddings

Recent generalist vision-language models (VLMs) have demonstrated impressive reasoning capabilities across diverse multimodal tasks. However, these models still struggle with fine-grained object-level understanding and grounding. In terms…

Computer Vision and Pattern Recognition · Computer Science 2024-06-04 Timothy Ossowski , Junjie Hu

CoDoL: Conditional Domain Prompt Learning for Out-of-Distribution Generalization

Recent advances in pre-training vision-language models (VLMs), e.g., contrastive language-image pre-training (CLIP) methods, have shown great potential in learning out-of-distribution (OOD) representations. Despite showing competitive…

Computer Vision and Pattern Recognition · Computer Science 2025-09-22 Min Zhang , Bo Jiang , Jie Zhou , Yimeng Liu , Xin Lin