Related papers: Task Me Anything

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

Comprehensive evaluation of Multimodal Large Language Models (MLLMs) has recently garnered widespread attention in the research community. However, we observe that existing benchmarks present several common barriers that make it difficult…

Computer Vision and Pattern Recognition · Computer Science 2025-02-06 Yi-Fan Zhang , Huanyu Zhang , Haochen Tian , Chaoyou Fu , Shuangqing Zhang , Junfei Wu , Feng Li , Kun Wang , Qingsong Wen , Zhang Zhang , Liang Wang , Rong Jin , Tieniu Tan

M5 -- A Diverse Benchmark to Assess the Performance of Large Multimodal Models Across Multilingual and Multicultural Vision-Language Tasks

Since the release of ChatGPT, the field of Natural Language Processing has experienced rapid advancements, particularly in Large Language Models (LLMs) and their multimodal counterparts, Large Multimodal Models (LMMs). Despite their…

Computation and Language · Computer Science 2024-08-27 Florian Schneider , Sunayana Sitaram

m&m's: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks

Real-world multi-modal problems are rarely solved by a single machine learning model, and often require multi-step computational plans that involve stitching several models. Tool-augmented LLMs hold tremendous promise for automating the…

Computer Vision and Pattern Recognition · Computer Science 2024-09-24 Zixian Ma , Weikai Huang , Jieyu Zhang , Tanmay Gupta , Ranjay Krishna

Evaluating the Effectiveness of Cost-Efficient Large Language Models in Benchmark Biomedical Tasks

This paper presents a comprehensive evaluation of cost-efficient Large Language Models (LLMs) for diverse biomedical tasks spanning both text and image modalities. We evaluated a range of closed-source and open-source LLMs on tasks such as…

Computation and Language · Computer Science 2025-07-21 Israt Jahan , Md Tahmid Rahman Laskar , Chun Peng , Jimmy Huang

MM-Telco: Benchmarks and Multimodal Large Language Models for Telecom Applications

Large Language Models (LLMs) have emerged as powerful tools for automating complex reasoning and decision-making tasks. In telecommunications, they hold the potential to transform network optimization, automate troubleshooting, enhance…

Artificial Intelligence · Computer Science 2026-04-20 Anshul Kumar , Gagan Raj Gupta , Manish Rai , Apu Chakraborty , Ashutosh Modi , Abdelaali Chaoub , Soumajit Pramanik , Moyank Giri , Yashwanth Holla , Sunny Kumar , M. V. Kiran Sooraj

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

Large Vision-Language Models (LVLMs) show significant strides in general-purpose multimodal applications such as visual dialogue and embodied navigation. However, existing multimodal evaluation benchmarks cover a limited number of…

Computer Vision and Pattern Recognition · Computer Science 2024-04-25 Kaining Ying , Fanqing Meng , Jin Wang , Zhiqian Li , Han Lin , Yue Yang , Hao Zhang , Wenbo Zhang , Yuqi Lin , Shuo Liu , Jiayi Lei , Quanfeng Lu , Runjian Chen , Peng Xu , Renrui Zhang , Haozhe Zhang , Peng Gao , Yali Wang , Yu Qiao , Ping Luo , Kaipeng Zhang , Wenqi Shao

TaskEval: Assessing Difficulty of Code Generation Tasks for Large Language Models

Large Language Models (LLMs) excel in code-related tasks like code generation, but benchmark evaluations often overlook task characteristics, such as difficulty. Moreover, benchmarks are usually built using tasks described with a single…

Software Engineering · Computer Science 2025-10-27 Florian Tambon , Amin Nikanjam , Cyrine Zid , Foutse Khomh , Giuliano Antoniol

Judge Anything: MLLM as a Judge Across Any Modality

Evaluating generative foundation models on open-ended multimodal understanding (MMU) and generation (MMG) tasks across diverse modalities (e.g., images, audio, video) poses significant challenges due to the complexity of cross-modal…

Computation and Language · Computer Science 2025-03-25 Shu Pu , Yaochen Wang , Dongping Chen , Yuhang Chen , Guohao Wang , Qi Qin , Zhongyi Zhang , Zhiyuan Zhang , Zetong Zhou , Shuang Gong , Yi Gui , Yao Wan , Philip S. Yu

SEED-Bench-2: Benchmarking Multimodal Large Language Models

Multimodal large language models (MLLMs), building upon the foundation of powerful large language models (LLMs), have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal…

Computer Vision and Pattern Recognition · Computer Science 2023-11-30 Bohao Li , Yuying Ge , Yixiao Ge , Guangzhi Wang , Rui Wang , Ruimao Zhang , Ying Shan

UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios

Recent evaluations of Large Multimodal Models (LMMs) have explored their capabilities in various domains, with only few benchmarks specifically focusing on urban environments. Moreover, existing urban benchmarks have been limited to…

Computer Vision and Pattern Recognition · Computer Science 2025-03-11 Baichuan Zhou , Haote Yang , Dairong Chen , Junyan Ye , Tianyi Bai , Jinhua Yu , Songyang Zhang , Dahua Lin , Conghui He , Weijia Li

MMR: Evaluating Reading Ability of Large Multimodal Models

Large multimodal models (LMMs) have demonstrated impressive capabilities in understanding various types of image, including text-rich images. Most existing text-rich image benchmarks are simple extraction-based question answering, and many…

Computer Vision and Pattern Recognition · Computer Science 2024-08-28 Jian Chen , Ruiyi Zhang , Yufan Zhou , Ryan Rossi , Jiuxiang Gu , Changyou Chen

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

Built on the power of LLMs, numerous multimodal large language models (MLLMs) have recently achieved remarkable performance on various vision-language tasks. However, most existing MLLMs and benchmarks primarily focus on single-image input…

Computer Vision and Pattern Recognition · Computer Science 2024-10-10 Haowei Liu , Xi Zhang , Haiyang Xu , Yaya Shi , Chaoya Jiang , Ming Yan , Ji Zhang , Fei Huang , Chunfeng Yuan , Bing Li , Weiming Hu

MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models

Existing MLLM benchmarks face significant challenges in evaluating Unified MLLMs (U-MLLMs) due to: 1) lack of standardized benchmarks for traditional tasks, leading to inconsistent comparisons; 2) absence of benchmarks for mixed-modality…

Computer Vision and Pattern Recognition · Computer Science 2025-04-08 Wulin Xie , Yi-Fan Zhang , Chaoyou Fu , Yang Shi , Bingyan Nie , Hongkai Chen , Zhang Zhang , Liang Wang , Tieniu Tan

Towards a Benchmark for Large Language Models for Business Process Management Tasks

An increasing number of organizations are deploying Large Language Models (LLMs) for a wide range of tasks. Despite their general utility, LLMs are prone to errors, ranging from inaccuracies to hallucinations. To objectively assess the…

Artificial Intelligence · Computer Science 2024-10-15 Kiran Busch , Henrik Leopold

NEMO: Can Multimodal LLMs Identify Attribute-Modified Objects?

Multimodal Large Language Models (MLLMs) have made notable advances in visual understanding, yet their abilities to recognize objects modified by specific attributes remain an open question. To address this, we explore MLLMs' reasoning…

Computer Vision and Pattern Recognition · Computer Science 2024-11-28 Jiaxuan Li , Junwen Mo , MinhDuc Vo , Akihiro Sugimoto , Hideki Nakayama

LMM4LMM: Benchmarking and Evaluating Large-multimodal Image Generation with LMMs

Recent breakthroughs in large multimodal models (LMMs) have significantly advanced both text-to-image (T2I) generation and image-to-text (I2T) interpretation. However, many generated images still suffer from issues related to perceptual…

Computer Vision and Pattern Recognition · Computer Science 2025-04-14 Jiarui Wang , Huiyu Duan , Yu Zhao , Juntong Wang , Guangtao Zhai , Xiongkuo Min

LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation

Large Multimodal Models (LMMs) are typically trained on vast corpora of image-text data but are often limited in linguistic coverage, leading to biased and unfair outputs across languages. While prior work has explored multimodal…

Computer Vision and Pattern Recognition · Computer Science 2025-07-11 Ananya Raval , Aravind Narayanan , Vahid Reza Khazaie , Shaina Raza

Towards Adaptive ML Benchmarks: Web-Agent-Driven Construction, Domain Expansion, and Metric Optimization

Recent advances in large language models (LLMs) have enabled the emergence of general-purpose agents for automating end-to-end machine learning (ML) workflows, including data analysis, feature engineering, model training, and competition…

Artificial Intelligence · Computer Science 2025-09-12 Hangyi Jia , Yuxi Qian , Hanwen Tong , Xinhui Wu , Lin Chen , Feng Wei

Cost-Efficient Estimation of General Abilities Across Benchmarks

Thousands of diverse benchmarks have been developed to measure the quality of large language models (LLMs). Yet prior work has demonstrated that LLM performance is often sufficiently explained by a small set of latent factors, or abilities.…

Computation and Language · Computer Science 2026-04-03 Michael Krumdick , Adam Wiemerslage , Seth Ebner , Charles Lovering , Chris Tanner

OmniGenBench: A Benchmark for Omnipotent Multimodal Generation across 50+ Tasks

Recent breakthroughs in large multimodal models (LMMs), such as the impressive GPT-4o-Native, have demonstrated remarkable proficiency in following general-purpose instructions for image generation. However, current benchmarks often lack…

Computer Vision and Pattern Recognition · Computer Science 2025-05-27 Jiayu Wang , Yang Jiao , Yue Yu , Tianwen Qian , Shaoxiang Chen , Jingjing Chen , Yu-Gang Jiang