Related papers: TaskCLIP: Extend Large Vision-Language Model for T…

Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception

Dense visual perception tasks have been constrained by their reliance on predefined categories, limiting their applicability in real-world scenarios where visual concepts are unbounded. While Vision-Language Models (VLMs) like CLIP have…

Computer Vision and Pattern Recognition · Computer Science 2025-08-18 Junjie Wang , Keyu Chen , Yulin Li , Bin Chen , Hengshuang Zhao , Xiaojuan Qi , Zhuotao Tian

DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception

Dense visual prediction tasks have been constrained by their reliance on predefined categories, limiting their applicability in real-world scenarios where visual concepts are unbounded. While Vision-Language Models (VLMs) like CLIP have…

Computer Vision and Pattern Recognition · Computer Science 2025-05-08 Junjie Wang , Bin Chen , Yulin Li , Bin Kang , Yichi Chen , Zhuotao Tian

LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models

Vision-language pre-training like CLIP has shown promising performance on various downstream tasks such as zero-shot image classification and image-text retrieval. Most of the existing CLIP-alike works usually adopt relatively large image…

Computer Vision and Pattern Recognition · Computer Science 2023-12-04 Ying Nie , Wei He , Kai Han , Yehui Tang , Tianyu Guo , Fanyi Du , Yunhe Wang

DetailCLIP: Detail-Oriented CLIP for Fine-Grained Tasks

In this paper, we introduce DetailCLIP: A Detail-Oriented CLIP to address the limitations of contrastive learning-based vision-language models, particularly CLIP, in handling detail-oriented and fine-grained tasks like segmentation. While…

Computer Vision and Pattern Recognition · Computer Science 2025-04-02 Amin Karimi Monsefi , Kishore Prakash Sailaja , Ali Alilooee , Ser-Nam Lim , Rajiv Ramnath

ECOR: Explainable CLIP for Object Recognition

Large Vision Language Models (VLMs), such as CLIP, have significantly contributed to various computer vision tasks, including object recognition and object detection. Their open vocabulary feature enhances their value. However, their…

Computer Vision and Pattern Recognition · Computer Science 2024-04-22 Ali Rasekh , Sepehr Kazemi Ranjbar , Milad Heidari , Wolfgang Nejdl

Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection

Vision-Language Models (VLMs) leverage aligned visual encoders to transform images into visual tokens, allowing them to be processed similarly to text by the backbone large language model (LLM). This unified input paradigm enables VLMs to…

Computer Vision and Pattern Recognition · Computer Science 2025-03-18 Bangzheng Li , Fei Wang , Wenxuan Zhou , Nan Xu , Ben Zhou , Sheng Zhang , Hoifung Poon , Muhao Chen

Is CLIP the main roadblock for fine-grained open-world perception?

Modern applications increasingly demand flexible computer vision models that adapt to novel concepts not encountered during training. This necessity is pivotal in emerging domains like extended reality, robotics, and autonomous driving,…

Computer Vision and Pattern Recognition · Computer Science 2024-04-05 Lorenzo Bianchi , Fabio Carrara , Nicola Messina , Fabrizio Falchi

Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation

Vision-Language Model (VLM) have gained widespread adoption in Open-Vocabulary (OV) object detection and segmentation tasks. Despite they have shown promise on OV-related tasks, their effectiveness in conventional vision tasks has thus far…

Computer Vision and Pattern Recognition · Computer Science 2025-04-15 Yongchao Feng , Yajie Liu , Shuai Yang , Wenrui Cai , Jinqing Zhang , Qiqi Zhan , Ziyue Huang , Hongxi Yan , Qiao Wan , Chenguang Liu , Junzhe Wang , Jiahui Lv , Ziqi Liu , Tengyuan Shi , Qingjie Liu , Yunhong Wang

DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection

Open-world object detection, as a more general and challenging goal, aims to recognize and localize objects described by arbitrary category names. The recent work GLIP formulates this problem as a grounding problem by concatenating all…

Computer Vision and Pattern Recognition · Computer Science 2022-10-18 Lewei Yao , Jianhua Han , Youpeng Wen , Xiaodan Liang , Dan Xu , Wei Zhang , Zhenguo Li , Chunjing Xu , Hang Xu

DesCLIP: Robust Continual Learning via General Attribute Descriptions for VLM-Based Visual Recognition

Continual learning of vision-language models (VLMs) focuses on leveraging cross-modal pretrained knowledge to incrementally adapt to expanding downstream tasks and datasets, while tackling the challenge of knowledge forgetting. Existing…

Computer Vision and Pattern Recognition · Computer Science 2026-03-24 Chiyuan He , Zihuan Qiu , Fanman Meng , Linfeng Xu , Qingbo Wu , Hongliang Li

TAB: Text-Align Anomaly Backbone Model for Industrial Inspection Tasks

In recent years, the focus on anomaly detection and localization in industrial inspection tasks has intensified. While existing studies have demonstrated impressive outcomes, they often rely heavily on extensive training datasets or robust…

Computer Vision and Pattern Recognition · Computer Science 2023-12-18 Ho-Weng Lee , Shang-Hong Lai

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Large language models (LLMs) have notably accelerated progress towards artificial general intelligence (AGI), with their impressive zero-shot capacity for user-tailored tasks, endowing them with immense potential across a range of…

Computer Vision and Pattern Recognition · Computer Science 2023-05-26 Wenhai Wang , Zhe Chen , Xiaokang Chen , Jiannan Wu , Xizhou Zhu , Gang Zeng , Ping Luo , Tong Lu , Jie Zhou , Yu Qiao , Jifeng Dai

Object Detection with Multimodal Large Vision-Language Models: An In-depth Review

The fusion of language and vision in large vision-language models (LVLMs) has revolutionized deep learning-based object detection by enhancing adaptability, contextual reasoning, and generalization beyond traditional architectures. This…

Computer Vision and Pattern Recognition · Computer Science 2025-10-01 Ranjan Sapkota , Manoj Karkee

Open Vocabulary Multi-Label Video Classification

Pre-trained vision-language models (VLMs) have enabled significant progress in open vocabulary computer vision tasks such as image classification, object detection and image segmentation. Some recent works have focused on extending VLMs to…

Computer Vision and Pattern Recognition · Computer Science 2025-10-14 Rohit Gupta , Mamshad Nayeem Rizve , Jayakrishnan Unnikrishnan , Ashish Tawari , Son Tran , Mubarak Shah , Benjamin Yao , Trishul Chilimbi

TAP: Targeted Prompting for Task Adaptive Generation of Textual Training Instances for Visual Classification

Vision and Language Models (VLMs), such as CLIP, have enabled visual recognition of a potentially unlimited set of categories described by text prompts. However, for the best visual recognition performance, these models still require tuning…

Computer Vision and Pattern Recognition · Computer Science 2023-09-14 M. Jehanzeb Mirza , Leonid Karlinsky , Wei Lin , Horst Possegger , Rogerio Feris , Horst Bischof

OVExp: Open Vocabulary Exploration for Object-Oriented Navigation

Object-oriented embodied navigation aims to locate specific objects, defined by category or depicted in images. Existing methods often struggle to generalize to open vocabulary goals without extensive training data. While recent advances in…

Robotics · Computer Science 2024-07-15 Meng Wei , Tai Wang , Yilun Chen , Hanqing Wang , Jiangmiao Pang , Xihui Liu

Open-Vocabulary Camouflaged Object Segmentation

Recently, the emergence of the large-scale vision-language model (VLM), such as CLIP, has opened the way towards open-world object perception. Many works have explored the utilization of pre-trained VLM for the challenging open-vocabulary…

Computer Vision and Pattern Recognition · Computer Science 2024-07-08 Youwei Pang , Xiaoqi Zhao , Jiaming Zuo , Lihe Zhang , Huchuan Lu

TULIP: Towards Unified Language-Image Pretraining

Despite the recent success of image-text contrastive models like CLIP and SigLIP, these models often struggle with vision-centric tasks that demand high-fidelity image understanding, such as counting, depth estimation, and fine-grained…

Computer Vision and Pattern Recognition · Computer Science 2025-04-09 Zineng Tang , Long Lian , Seun Eisape , XuDong Wang , Roei Herzig , Adam Yala , Alane Suhr , Trevor Darrell , David M. Chan

KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation

Self-supervised vision-and-language pretraining (VLP) aims to learn transferable multi-modal representations from large-scale image-text data and to achieve strong performances on a broad scope of vision-language tasks after finetuning.…

Computer Vision and Pattern Recognition · Computer Science 2022-08-09 Yongfei Liu , Chenfei Wu , Shao-yen Tseng , Vasudev Lal , Xuming He , Nan Duan

ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements

Recent advances in foundational Vision Language Models (VLMs) have reshaped the evaluation paradigm in computer vision tasks. These foundational models, especially CLIP, have accelerated research in open-vocabulary computer vision tasks,…

Computer Vision and Pattern Recognition · Computer Science 2025-04-15 M. Arda Aydın , Efe Mert Çırpar , Elvin Abdinli , Gozde Unal , Yusuf H. Sahin