English
Related papers

Related papers: Exploiting LMM-based knowledge for image classific…

200 papers

In this paper we deal with image classification tasks using the powerful CLIP vision-language model. Our goal is to advance the classification performance using the CLIP's image encoder, by proposing a novel Large Multimodal Model (LMM)…

Computer Vision and Pattern Recognition · Computer Science 2024-12-17 Maria Tzelepi , Vasileios Mezaris

In this paper we deal with the task of Disturbing Image Detection (DID), exploiting knowledge encoded in Large Multimodal Models (LMMs). Specifically, we propose to exploit LMM knowledge in a two-fold manner: first by extracting generic…

Computer Vision and Pattern Recognition · Computer Science 2024-06-19 Maria Tzelepi , Vasileios Mezaris

Recently, there has been a surge in the popularity of pre trained large language models (LLMs) (such as GPT-4), sweeping across the entire Natural Language Processing (NLP) and Computer Vision (CV) communities. These LLMs have demonstrated…

Computer Vision and Pattern Recognition · Computer Science 2024-01-09 Shuxiao Ma , Linyuan Wang , Senbao Hou , Bin Yan

The Multimodal Large Language Models (MLLMs) have activated the capabilitiesof Large Language Models (LLMs) in solving visual-language tasks by integratingvisual information. The prevailing approach in existing MLLMs involvesemploying an…

Computer Vision and Pattern Recognition · Computer Science 2024-10-31 Tianxiang Wu , Minxin Nie , Ziqiang Cao

We propose a method to fuse frozen text-only large language models (LLMs) with pre-trained image encoder and decoder models, by mapping between their embedding spaces. Our model demonstrates a wide suite of multimodal capabilities: image…

Computation and Language · Computer Science 2023-10-16 Jing Yu Koh , Daniel Fried , Ruslan Salakhutdinov

(Renyi Qu's Master's Thesis) Recent advancements in interpretable models for vision-language tasks have achieved competitive performance; however, their interpretability often suffers due to the reliance on unstructured text outputs from…

Computer Vision and Pattern Recognition · Computer Science 2024-06-04 Renyi Qu , Mark Yatskar

Large language models (LLMs) have been effectively used for many computer vision tasks, including image classification. In this paper, we present a simple yet effective approach for zero-shot image classification using multimodal LLMs.…

Computer Vision and Pattern Recognition · Computer Science 2025-06-27 Abdelrahman Abdelhamed , Mahmoud Afifi , Alec Go

Large language models (LLMs) have made significant advancements in natural language understanding. However, through that enormous semantic representation that the LLM has learnt, is it somehow possible for it to understand images as well?…

Computer Vision and Pattern Recognition · Computer Science 2024-07-12 Mu Cai , Zeyi Huang , Yuheng Li , Utkarsh Ojha , Haohan Wang , Yong Jae Lee

In this paper, we rethink sparse lexical representations for image retrieval. By utilizing multi-modal large language models (M-LLMs) that support visual prompting, we can extract image features and convert them into textual data, enabling…

Computer Vision and Pattern Recognition · Computer Science 2024-08-30 Kengo Nakata , Daisuke Miyashita , Youyang Ng , Yasuto Hoshi , Jun Deguchi

Supported by powerful generative models, low-bitrate learned image compression (LIC) models utilizing perceptual metrics have become feasible. Some of the most advanced models achieve high compression rates and superior perceptual quality…

Image and Video Processing · Electrical Eng. & Systems 2024-11-21 Shimon Murai , Heming Sun , Jiro Katto

Image Captioning, or the automatic generation of descriptions for images, is one of the core problems in Computer Vision and has seen considerable progress using Deep Learning Techniques. We propose to use Inception-ResNet Convolutional…

Computer Vision and Pattern Recognition · Computer Science 2021-02-23 Sulabh Katiyar , Samir Kumar Borgohain

The advancement of Multimodal Large Language Models (MLLMs) has greatly accelerated the development of applications in understanding integrated texts and images. Recent works leverage image-caption datasets to train MLLMs, achieving…

Computation and Language · Computer Science 2024-11-22 Mingxu Tao , Quzhe Huang , Kun Xu , Liwei Chen , Yansong Feng , Dongyan Zhao

Visual Language Models (VLMs) are now increasingly being merged with Large Language Models (LLMs) to enable new capabilities, particularly in terms of improved interactivity and open-ended responsiveness. While these are remarkable…

Vision-Language Models (VLMs) leverage aligned visual encoders to transform images into visual tokens, allowing them to be processed similarly to text by the backbone large language model (LLM). This unified input paradigm enables VLMs to…

Computer Vision and Pattern Recognition · Computer Science 2025-03-18 Bangzheng Li , Fei Wang , Wenxuan Zhou , Nan Xu , Ben Zhou , Sheng Zhang , Hoifung Poon , Muhao Chen

Multimodal large language models (MLLMs) have achieved impressive performance across various tasks such as image captioning and visual question answer(VQA); however, they often struggle to accurately interpret depth information inherent in…

Computer Vision and Pattern Recognition · Computer Science 2026-03-09 Hao Yang , Hongbo Zhang , Yanyan Zhao , Bing Qin

This paper introduces MiniGPT4-Video, a multimodal Large Language Model (LLM) designed specifically for video understanding. The model is capable of processing both temporal visual and textual data, making it adept at understanding the…

Computer Vision and Pattern Recognition · Computer Science 2024-04-05 Kirolos Ataallah , Xiaoqian Shen , Eslam Abdelrahman , Essam Sleiman , Deyao Zhu , Jian Ding , Mohamed Elhoseiny

Multimodal Large Language Models (MLLMs) are experiencing rapid growth, yielding a plethora of noteworthy contributions in recent months. The prevailing trend involves adopting data-driven methodologies, wherein diverse…

Computer Vision and Pattern Recognition · Computer Science 2024-01-17 Xin He , Longhui Wei , Lingxi Xie , Qi Tian

Accurately describing images with text is a foundation of explainable AI. Vision-Language Models (VLMs) like CLIP have recently addressed this by aligning images and texts in a shared embedding space, expressing semantic similarities…

Computer Vision and Pattern Recognition · Computer Science 2024-12-20 Pingchuan Ma , Lennart Rietdorf , Dmytro Kotovenko , Vincent Tao Hu , Björn Ommer

We introduce SAM4MLLM, an innovative approach which integrates the Segment Anything Model (SAM) with Multi-Modal Large Language Models (MLLMs) for pixel-aware tasks. Our method enables MLLMs to learn pixel-level location information without…

Artificial Intelligence · Computer Science 2024-12-17 Yi-Chia Chen , Wei-Hua Li , Cheng Sun , Yu-Chiang Frank Wang , Chu-Song Chen

With the rapid advancement of Multimodal Large Language Models (MLLMs), a variety of benchmarks have been introduced to evaluate their capabilities. While most evaluations have focused on complex tasks such as scientific comprehension and…

Computer Vision and Pattern Recognition · Computer Science 2024-12-24 Huan Liu , Lingyu Xiao , Jiangjiang Liu , Xiaofan Li , Ze Feng , Sen Yang , Jingdong Wang
‹ Prev 1 2 3 10 Next ›