Related papers: DevBench: A multimodal developmental benchmark for…

JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images

Existing vision-language understanding benchmarks largely consist of images of objects in their usual contexts. As a consequence, recent multimodal large language models can perform well with only a shallow visual understanding by relying…

Computer Vision and Pattern Recognition · Computer Science 2025-01-13 Zhecan Wang , Junzhang Liu , Chia-Wei Tang , Hani Alomari , Anushka Sivakumar , Rui Sun , Wenhao Li , Md. Atabuzzaman , Hammad Ayyubi , Haoxuan You , Alvi Ishmam , Kai-Wei Chang , Shih-Fu Chang , Chris Thomas

DevBench: A Realistic, Developer-Informed Benchmark for Code Generation Models

DevBench is a telemetry-driven benchmark designed to evaluate Large Language Models (LLMs) on realistic code completion tasks. It includes 1,800 evaluation instances across six programming languages and six task categories derived from real…

Machine Learning · Computer Science 2026-05-19 Adarsh Kumarappan , Pareesa Ameneh Golnari , Wen Wen , Xiaoyu Liu , Gabriel Ryan , Yuting Sun , Shengyu Fu , Elsie Nallipogu

ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Capability for Large Vision-Language Models

This paper presents ConvBench, a novel multi-turn conversation evaluation benchmark tailored for Large Vision-Language Models (LVLMs). Unlike existing benchmarks that assess individual capabilities in single-turn dialogues, ConvBench adopts…

Multimedia · Computer Science 2024-04-26 Shuo Liu , Kaining Ying , Hao Zhang , Yue Yang , Yuqi Lin , Tianle Zhang , Chuanhao Li , Yu Qiao , Ping Luo , Wenqi Shao , Kaipeng Zhang

MMBench: Is Your Multi-modal Model an All-around Player?

Large vision-language models (VLMs) have recently achieved remarkable progress, exhibiting impressive multimodal perception and reasoning abilities. However, effectively evaluating these large VLMs remains a major challenge, hindering…

Computer Vision and Pattern Recognition · Computer Science 2024-08-21 Yuan Liu , Haodong Duan , Yuanhan Zhang , Bo Li , Songyang Zhang , Wangbo Zhao , Yike Yuan , Jiaqi Wang , Conghui He , Ziwei Liu , Kai Chen , Dahua Lin

IndicVisionBench: Benchmarking Cultural and Multilingual Understanding in VLMs

Vision-language models (VLMs) have demonstrated impressive generalization across multimodal tasks, yet most evaluation benchmarks remain Western-centric, leaving open questions about their performance in culturally diverse and multilingual…

Computer Vision and Pattern Recognition · Computer Science 2025-11-10 Ali Faraz , Akash , Shaharukh Khan , Raja Kolla , Akshat Patidar , Suranjan Goswami , Abhinav Ravi , Chandra Khatri , Shubham Agarwal

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models. However, most benchmarks predominantly assess…

Computer Vision and Pattern Recognition · Computer Science 2024-05-24 Kunchang Li , Yali Wang , Yinan He , Yizhuo Li , Yi Wang , Yi Liu , Zun Wang , Jilan Xu , Guo Chen , Ping Luo , Limin Wang , Yu Qiao

MANBench: Is Your Multimodal Model Smarter than Human?

The rapid advancement of Multimodal Large Language Models (MLLMs) has ignited discussions regarding their potential to surpass human performance in multimodal tasks. In response, we introduce MANBench (Multimodal Ability Norms Benchmark), a…

Computation and Language · Computer Science 2025-06-16 Han Zhou , Qitong Xu , Yiheng Dong , Xin Yang

VisNumBench: Evaluating Number Sense of Multimodal Large Language Models

Can Multimodal Large Language Models (MLLMs) develop an intuitive number sense similar to humans? Targeting this problem, we introduce Visual Number Benchmark (VisNumBench) to evaluate the number sense abilities of MLLMs across a wide range…

Computer Vision and Pattern Recognition · Computer Science 2025-08-01 Tengjin Weng , Jingyi Wang , Wenhao Jiang , Zhong Ming

MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks

The popularity of multimodal large language models (MLLMs) has triggered a recent surge in research efforts dedicated to evaluating these models. Nevertheless, existing evaluation studies of MLLMs primarily focus on the comprehension and…

Computation and Language · Computer Science 2023-10-16 Xiaocui Yang , Wenfang Wu , Shi Feng , Ming Wang , Daling Wang , Yang Li , Qi Sun , Yifei Zhang , Xiaoming Fu , Soujanya Poria

CompareBench: A Benchmark for Visual Comparison Reasoning in Vision-Language Models

We introduce CompareBench, a benchmark for evaluating visual comparison reasoning in vision-language models (VLMs), a fundamental yet understudied skill. CompareBench consists of 1000 QA pairs across four tasks: quantity (600), temporal…

Computer Vision and Pattern Recognition · Computer Science 2025-12-19 Jie Cai , Kangning Yang , Lan Fu , Jiaming Ding , Jinlong Li , Huiming Sun , Daitao Xing , Jinglin Shen , Zibo Meng

Beyond the Visible: Benchmarking Occlusion Perception in Multimodal Large Language Models

Occlusion perception, a critical foundation for human-level spatial understanding, embodies the challenge of integrating visual recognition and reasoning. Though multimodal large language models (MLLMs) have demonstrated remarkable…

Computer Vision and Pattern Recognition · Computer Science 2025-08-07 Zhaochen Liu , Kaiwen Gao , Shuyi Liang , Bin Xiao , Limeng Qiao , Lin Ma , Tingting Jiang

CVBench: Benchmarking Cross-Video Synergies for Complex Multimodal Reasoning

While multimodal large language models (MLLMs) exhibit strong performance on single-video tasks (e.g., video question answering), their capability for spatiotemporal pattern reasoning across multiple videos remains a critical gap in pattern…

Computer Vision and Pattern Recognition · Computer Science 2026-01-07 Nannan Zhu , Yonghao Dong , Teng Wang , Xueqian Li , Shengjun Deng , Yijia Wang , Zheng Hong , Tiantian Geng , Guo Niu , Hanyan Huang , Xiongfei Yao , Shuaiwei Jiao

Do MLLMs Exhibit Human-like Perceptual Behaviors? HVSBench: A Benchmark for MLLM Alignment with Human Perceptual Behavior

While Multimodal Large Language Models (MLLMs) excel at many vision tasks, it is unknown if they exhibit human-like perceptual behaviors. To evaluate this, we introduce HVSBench, the first large-scale benchmark with over 85,000 samples…

Computer Vision and Pattern Recognition · Computer Science 2025-12-18 Jiaying Lin , Shuquan Ye , Dan Xu , Wanli Ouyang , Rynson W. H. Lau

HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks

Evaluating the nuanced human-centric video understanding capabilities of Multimodal Large Language Models (MLLMs) remains a great challenge, as existing benchmarks often overlook the intricacies of emotion, behavior, and cross-modal…

Computer Vision and Pattern Recognition · Computer Science 2026-04-14 Ting Zhou , Daoyuan Chen , Qirui Jiao , Bolin Ding , Yaliang Li , Ying Shen

MVPBench: A Multi-Video Perception Evaluation Benchmark for Multi-Modal Video Understanding

The rapid progress of Large Language Models (LLMs) has spurred growing interest in Multi-modal LLMs (MLLMs) and motivated the development of benchmarks to evaluate their perceptual and comprehension abilities. Existing benchmarks, however,…

Computer Vision and Pattern Recognition · Computer Science 2026-03-25 Purui Bai , Tao Wu , Jiayang Sun , Xinyue Liu , Huaibo Huang , Ran He

VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?

Multimodal Large Language models (MLLMs) have shown promise in web-related tasks, but evaluating their performance in the web domain remains a challenge due to the lack of comprehensive benchmarks. Existing benchmarks are either designed…

Computation and Language · Computer Science 2024-04-10 Junpeng Liu , Yifan Song , Bill Yuchen Lin , Wai Lam , Graham Neubig , Yuanzhi Li , Xiang Yue

Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency

Recent advancements in Large Vision-Language Models (LVLMs) have significantly enhanced their ability to integrate visual and linguistic information, achieving near-human proficiency in tasks like object recognition, captioning, and visual…

Computer Vision and Pattern Recognition · Computer Science 2025-05-14 Zhikai Wang , Jiashuo Sun , Wenqi Zhang , Zhiqiang Hu , Xin Li , Fan Wang , Deli Zhao

Benchmarking Deflection and Hallucination in Large Vision-Language Models

Large Vision-Language Models (LVLMs) increasingly rely on retrieval to answer knowledge-intensive multimodal questions. Existing benchmarks overlook conflicts between visual and textual evidence and the importance of generating deflections…

Computation and Language · Computer Science 2026-04-15 Nicholas Moratelli , Christopher Davis , Leonardo F. R. Ribeiro , Bill Byrne , Gonzalo Iglesias

Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models

Reward models play an essential role in training vision-language models (VLMs) by assessing output quality to enable aligning with human preferences. Despite their importance, the research community lacks comprehensive open benchmarks for…

Computer Vision and Pattern Recognition · Computer Science 2025-02-21 Michihiro Yasunaga , Luke Zettlemoyer , Marjan Ghazvininejad

Scalable Performance Analysis for Vision-Language Models

Joint vision-language models have shown great performance over a diverse set of tasks. However, little is known about their limitations, as the high dimensional space learned by these models makes it difficult to identify semantic errors.…

Computer Vision and Pattern Recognition · Computer Science 2023-06-01 Santiago Castro , Oana Ignat , Rada Mihalcea