Related papers: Self-Supervised Visual Preference Alignment

VideoSAVi: Self-Aligned Video Language Models without Human Supervision

Recent advances in video-large language models (Video-LLMs) have led to significant progress in video understanding. Current preference optimization methods often rely on proprietary APIs or human-annotated captions to generate preference…

Computer Vision and Pattern Recognition · Computer Science 2025-08-12 Yogesh Kulkarni , Pooyan Fazli

Synth-Align: Improving Trustworthiness in Vision-Language Model with Synthetic Preference Data Alignment

Large Vision-Language Models (LVLMs) have shown promising capabilities in understanding and generating information by integrating both visual and textual data. However, current models are still prone to hallucinations, which degrade the…

Computer Vision and Pattern Recognition · Computer Science 2025-11-13 Robert Wijaya , Ngoc-Bao Nguyen , Ngai-Man Cheung

RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data

Large vision-language models (LVLMs) often fail to align with human preferences, leading to issues like generating misleading content without proper visual context (also known as hallucination). A promising solution to this problem is using…

Computer Vision and Pattern Recognition · Computer Science 2025-02-03 Chenglong Wang , Yang Gan , Yifu Huo , Yongyu Mu , Murun Yang , Qiaozhi He , Tong Xiao , Chunliang Zhang , Tongran Liu , Quan Du , Di Yang , Jingbo Zhu

SHAPE : Self-Improved Visual Preference Alignment by Iteratively Generating Holistic Winner

Large Visual Language Models (LVLMs) increasingly rely on preference alignment to ensure reliability, which steers the model behavior via preference fine-tuning on preference data structured as ``image - winner text - loser text'' triplets.…

Computer Vision and Pattern Recognition · Computer Science 2025-03-10 Kejia Chen , Jiawen Zhang , Jiacong Hu , Jiazhen Yang , Jian Lou , Zunlei Feng , Mingli Song

Improving Large Vision and Language Models by Learning from a Panel of Peers

Traditional alignment methods for Large Vision and Language Models (LVLMs) primarily rely on human-curated preference data. Human-generated preference data is costly; machine-generated preference data is limited in quality; and…

Computer Vision and Pattern Recognition · Computer Science 2025-09-03 Jefferson Hernandez , Jing Shi , Simon Jenni , Vicente Ordonez , Kushal Kafle

Probing Visual Language Priors in VLMs

Despite recent advances in Vision-Language Models (VLMs), they may over-rely on visual language priors existing in their training data rather than true visual reasoning. To investigate this, we introduce ViLP, a benchmark featuring…

Computer Vision and Pattern Recognition · Computer Science 2025-04-15 Tiange Luo , Ang Cao , Gunhee Lee , Justin Johnson , Honglak Lee

Preference VLM: Leveraging VLMs for Scalable Preference-Based Reinforcement Learning

Preference-based reinforcement learning (RL) offers a promising approach for aligning policies with human intent but is often constrained by the high cost of human feedback. In this work, we introduce PrefVLM, a framework that integrates…

Machine Learning · Computer Science 2025-02-04 Udita Ghosh , Dripta S. Raychaudhuri , Jiachen Li , Konstantinos Karydis , Amit Roy-Chowdhury

Feedback-Driven Vision-Language Alignment with Minimal Human Supervision

Vision-language models (VLMs) have demonstrated remarkable potential in integrating visual and linguistic information, but their performance is often constrained by the need for extensive, high-quality image-text training data. Curation of…

Computer Vision and Pattern Recognition · Computer Science 2025-05-20 Giorgio Giannone , Ruoteng Li , Qianli Feng , Evgeny Perevodchikov , Rui Chen , Aleix Martinez

ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models

Large vision-language models (LVLMs) have shown premise in a broad range of vision-language tasks with their strong reasoning and generalization capabilities. However, they require considerable computational resources for training and…

Computation and Language · Computer Science 2024-06-18 Guiming Hardy Chen , Shunian Chen , Ruifei Zhang , Junying Chen , Xiangbo Wu , Zhiyi Zhang , Zhihong Chen , Jianquan Li , Xiang Wan , Benyou Wang

VaPR -- Vision-language Preference alignment for Reasoning

Preference finetuning methods like Direct Preference Optimization (DPO) with AI-generated feedback have shown promise in aligning Large Vision-Language Models (LVLMs) with human preferences. However, existing techniques overlook the…

Artificial Intelligence · Computer Science 2025-10-03 Rohan Wadhawan , Fabrice Y Harel-Canada , Zi-Yi Dou , Suhaila Shakiah , Robinson Piramuthu , Nanyun Peng

Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

Instruction-following Vision Large Language Models (VLLMs) have achieved significant progress recently on a variety of tasks. These approaches merge strong pre-trained vision models and large language models (LLMs). Since these components…

Machine Learning · Computer Science 2024-02-20 Yiyang Zhou , Chenhang Cui , Rafael Rafailov , Chelsea Finn , Huaxiu Yao

AutoV: Loss-Oriented Ranking for Visual Prompt Retrieval in LVLMs

Inspired by text prompts in large language models, visual prompts have been explored to enhance the perceptual capabilities of large vision-language models (LVLMs). However, performance tends to saturate under single visual prompt designs,…

Computer Vision and Pattern Recognition · Computer Science 2026-03-06 Yuan Zhang , Chun-Kai Fan , Sicheng Yu , Junwen Pan , Tao Huang , Ming Lu , Kuan Cheng , Qi She , Shanghang Zhang

Beyond Human Data: Aligning Multimodal Large Language Models by Iterative Self-Evolution

Human preference alignment can greatly enhance Multimodal Large Language Models (MLLMs), but collecting high-quality preference data is costly. A promising solution is the self-evolution strategy, where models are iteratively trained on…

Machine Learning · Computer Science 2024-12-23 Wentao Tan , Qiong Cao , Yibing Zhan , Chao Xue , Changxing Ding

ProAPO: Progressively Automatic Prompt Optimization for Visual Classification

Vision-language models (VLMs) have made significant progress in image classification by training with large-scale paired image-text data. Their performances largely depend on the prompt quality. While recent methods show that visual…

Computer Vision and Pattern Recognition · Computer Science 2026-02-12 Xiangyan Qu , Gaopeng Gou , Jiamin Zhuang , Jing Yu , Kun Song , Qihao Wang , Yili Li , Gang Xiong

SelfPrompt: Confidence-Aware Semi-Supervised Tuning for Robust Vision-Language Model Adaptation

We present SelfPrompt, a novel prompt-tuning approach for vision-language models (VLMs) in a semi-supervised learning setup. Existing methods for tuning VLMs in semi-supervised setups struggle with the negative impact of the miscalibrated…

Computer Vision and Pattern Recognition · Computer Science 2025-01-30 Shuvendu Roy , Ali Etemad

Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference Optimization

The emergence of large Vision Language Models (VLMs) has broadened the scope and capabilities of single-modal Large Language Models (LLMs) by integrating visual modalities, thereby unlocking transformative cross-modal applications in a…

Computer Vision and Pattern Recognition · Computer Science 2025-09-23 Shuo Xing , Peiran Li , Yuping Wang , Ruizheng Bai , Yueqi Wang , Chan-Wei Hu , Chengxuan Qian , Huaxiu Yao , Zhengzhong Tu

Your Vision-Language Model Itself Is a Strong Filter: Towards High-Quality Instruction Tuning with Data Selection

Data selection in instruction tuning emerges as a pivotal process for acquiring high-quality data and training instruction-following large language models (LLMs), but it is still a new and unexplored research area for vision-language models…

Computation and Language · Computer Science 2024-02-21 Ruibo Chen , Yihan Wu , Lichang Chen , Guodong Liu , Qi He , Tianyi Xiong , Chenxi Liu , Junfeng Guo , Heng Huang

Rec-GPT4V: Multimodal Recommendation with Large Vision-Language Models

The development of large vision-language models (LVLMs) offers the potential to address challenges faced by traditional multimodal recommendations thanks to their proficient understanding of static images and textual dynamics. However, the…

Artificial Intelligence · Computer Science 2024-02-14 Yuqing Liu , Yu Wang , Lichao Sun , Philip S. Yu

Unsupervised Multiview Contrastive Language-Image Joint Learning with Pseudo-Labeled Prompts Via Vision-Language Model for 3D/4D Facial Expression Recognition

In this paper, we introduce MultiviewVLM, a vision-language model designed for unsupervised contrastive multiview representation learning of facial emotions from 3D/4D data. Our architecture integrates pseudo-labels derived from generated…

Computer Vision and Pattern Recognition · Computer Science 2025-05-15 Muzammil Behzad

Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models via Vision-Guided Reinforcement Learning

Large Vision-Language Models (LVLMs) typically follow a two-stage training paradigm-pretraining and supervised fine-tuning. Recently, preference optimization, derived from the language domain, has emerged as an effective post-training…

Computer Vision and Pattern Recognition · Computer Science 2025-03-25 Yufei Zhan , Yousong Zhu , Shurong Zheng , Hongyin Zhao , Fan Yang , Ming Tang , Jinqiao Wang