Related papers: Exploiting LMM-based knowledge for image classific…

LMM-Regularized CLIP Embeddings for Image Classification

In this paper we deal with image classification tasks using the powerful CLIP vision-language model. Our goal is to advance the classification performance using the CLIP's image encoder, by proposing a novel Large Multimodal Model (LMM)…

Computer Vision and Pattern Recognition · Computer Science 2024-12-17 Maria Tzelepi , Vasileios Mezaris

Disturbing Image Detection Using LMM-Elicited Emotion Embeddings

In this paper we deal with the task of Disturbing Image Detection (DID), exploiting knowledge encoded in Large Multimodal Models (LMMs). Specifically, we propose to exploit LMM knowledge in a two-fold manner: first by extracting generic…

Computer Vision and Pattern Recognition · Computer Science 2024-06-19 Maria Tzelepi , Vasileios Mezaris

Aligned with LLM: a new multi-modal training paradigm for encoding fMRI activity in visual cortex

Recently, there has been a surge in the popularity of pre trained large language models (LLMs) (such as GPT-4), sweeping across the entire Natural Language Processing (NLP) and Computer Vision (CV) communities. These LLMs have demonstrated…

Computer Vision and Pattern Recognition · Computer Science 2024-01-09 Shuxiao Ma , Linyuan Wang , Senbao Hou , Bin Yan

PIP-MM: Pre-Integrating Prompt Information into Visual Encoding via Existing MLLM Structures

The Multimodal Large Language Models (MLLMs) have activated the capabilitiesof Large Language Models (LLMs) in solving visual-language tasks by integratingvisual information. The prevailing approach in existing MLLMs involvesemploying an…

Computer Vision and Pattern Recognition · Computer Science 2024-10-31 Tianxiang Wu , Minxin Nie , Ziqiang Cao

Generating Images with Multimodal Language Models

We propose a method to fuse frozen text-only large language models (LLMs) with pre-trained image encoder and decoder models, by mapping between their embedding spaces. Our model demonstrates a wide suite of multimodal capabilities: image…

Computation and Language · Computer Science 2023-10-16 Jing Yu Koh , Daniel Fried , Ruslan Salakhutdinov

LLM-based Hierarchical Concept Decomposition for Interpretable Fine-Grained Image Classification

(Renyi Qu's Master's Thesis) Recent advancements in interpretable models for vision-language tasks have achieved competitive performance; however, their interpretability often suffers due to the reliance on unstructured text outputs from…

Computer Vision and Pattern Recognition · Computer Science 2024-06-04 Renyi Qu , Mark Yatskar

What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models

Large language models (LLMs) have been effectively used for many computer vision tasks, including image classification. In this paper, we present a simple yet effective approach for zero-shot image classification using multimodal LLMs.…

Computer Vision and Pattern Recognition · Computer Science 2025-06-27 Abdelrahman Abdelhamed , Mahmoud Afifi , Alec Go

Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding

Large language models (LLMs) have made significant advancements in natural language understanding. However, through that enormous semantic representation that the LLM has learnt, is it somehow possible for it to understand images as well?…

Computer Vision and Pattern Recognition · Computer Science 2024-07-12 Mu Cai , Zeyi Huang , Yuheng Li , Utkarsh Ojha , Haohan Wang , Yong Jae Lee

Rethinking Sparse Lexical Representations for Image Retrieval in the Age of Rising Multi-Modal Large Language Models

In this paper, we rethink sparse lexical representations for image retrieval. By utilizing multi-modal large language models (M-LLMs) that support visual prompting, we can extract image features and convert them into textual data, enabling…

Computer Vision and Pattern Recognition · Computer Science 2024-08-30 Kengo Nakata , Daisuke Miyashita , Youyang Ng , Yasuto Hoshi , Jun Deguchi

LMM-driven Semantic Image-Text Coding for Ultra Low-bitrate Learned Image Compression

Supported by powerful generative models, low-bitrate learned image compression (LIC) models utilizing perceptual metrics have become feasible. Some of the most advanced models achieve high compression rates and superior perceptual quality…

Image and Video Processing · Electrical Eng. & Systems 2024-11-21 Shimon Murai , Heming Sun , Jiro Katto

Image Captioning using Deep Stacked LSTMs, Contextual Word Embeddings and Data Augmentation

Image Captioning, or the automatic generation of descriptions for images, is one of the core problems in Computer Vision and has seen considerable progress using Deep Learning Techniques. We propose to use Inception-ResNet Convolutional…

Computer Vision and Pattern Recognition · Computer Science 2021-02-23 Sulabh Katiyar , Samir Kumar Borgohain

Probing Multimodal Large Language Models for Global and Local Semantic Representations

The advancement of Multimodal Large Language Models (MLLMs) has greatly accelerated the development of applications in understanding integrated texts and images. Recent works leverage image-caption datasets to train MLLMs, achieving…

Computation and Language · Computer Science 2024-11-22 Mingxu Tao , Quzhe Huang , Kun Xu , Liwei Chen , Yansong Feng , Dongyan Zhao

Rethinking VLMs and LLMs for Image Classification

Visual Language Models (VLMs) are now increasingly being merged with Large Language Models (LLMs) to enable new capabilities, particularly in terms of improved interactivity and open-ended responsiveness. While these are remarkable…

Machine Learning · Computer Science 2024-10-22 Avi Cooper , Keizo Kato , Chia-Hsien Shih , Hiroaki Yamane , Kasper Vinken , Kentaro Takemoto , Taro Sunagawa , Hao-Wei Yeh , Jin Yamanaka , Ian Mason , Xavier Boix

Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection

Vision-Language Models (VLMs) leverage aligned visual encoders to transform images into visual tokens, allowing them to be processed similarly to text by the backbone large language model (LLM). This unified input paradigm enables VLMs to…

Computer Vision and Pattern Recognition · Computer Science 2025-03-18 Bangzheng Li , Fei Wang , Wenxuan Zhou , Nan Xu , Ben Zhou , Sheng Zhang , Hoifung Poon , Muhao Chen

DeepSight: Bridging Depth Maps and Language with a Depth-Driven Multimodal Model

Multimodal large language models (MLLMs) have achieved impressive performance across various tasks such as image captioning and visual question answer(VQA); however, they often struggle to accurately interpret depth information inherent in…

Computer Vision and Pattern Recognition · Computer Science 2026-03-09 Hao Yang , Hongbo Zhang , Yanyan Zhao , Bing Qin

MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens

This paper introduces MiniGPT4-Video, a multimodal Large Language Model (LLM) designed specifically for video understanding. The model is capable of processing both temporal visual and textual data, making it adept at understanding the…

Computer Vision and Pattern Recognition · Computer Science 2024-04-05 Kirolos Ataallah , Xiaoqian Shen , Eslam Abdelrahman , Essam Sleiman , Deyao Zhu , Jian Ding , Mohamed Elhoseiny

Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models

Multimodal Large Language Models (MLLMs) are experiencing rapid growth, yielding a plethora of noteworthy contributions in recent months. The prevailing trend involves adopting data-driven methodologies, wherein diverse…

Computer Vision and Pattern Recognition · Computer Science 2024-01-17 Xin He , Longhui Wei , Lingxi Xie , Qi Tian

Does VLM Classification Benefit from LLM Description Semantics?

Accurately describing images with text is a foundation of explainable AI. Vision-Language Models (VLMs) like CLIP have recently addressed this by aligning images and texts in a shared embedding space, expressing semantic similarities…

Computer Vision and Pattern Recognition · Computer Science 2024-12-20 Pingchuan Ma , Lennart Rietdorf , Dmytro Kotovenko , Vincent Tao Hu , Björn Ommer

SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation

We introduce SAM4MLLM, an innovative approach which integrates the Segment Anything Model (SAM) with Multi-Modal Large Language Models (MLLMs) for pixel-aware tasks. Our method enables MLLMs to learn pixel-level location information without…

Artificial Intelligence · Computer Science 2024-12-17 Yi-Chia Chen , Wei-Hua Li , Cheng Sun , Yu-Chiang Frank Wang , Chu-Song Chen

Revisiting MLLMs: An In-Depth Analysis of Image Classification Abilities

With the rapid advancement of Multimodal Large Language Models (MLLMs), a variety of benchmarks have been introduced to evaluate their capabilities. While most evaluations have focused on complex tasks such as scientific comprehension and…

Computer Vision and Pattern Recognition · Computer Science 2024-12-24 Huan Liu , Lingyu Xiao , Jiangjiang Liu , Xiaofan Li , Ze Feng , Sen Yang , Jingdong Wang