Related papers: KIEval: A Knowledge-grounded Interactive Evaluatio…

A Survey on Data Contamination for Large Language Models

Recent advancements in Large Language Models (LLMs) have demonstrated significant progress in various areas, such as text generation and code synthesis. However, the reliability of performance evaluation has come under scrutiny due to data…

Computation and Language · Computer Science 2025-06-06 Yuxing Cheng , Yi Chang , Yuan Wu

CLEAN-EVAL: Clean Evaluation on Contaminated Large Language Models

We are currently in an era of fierce competition among various large language models (LLMs) continuously pushing the boundaries of benchmark performance. However, genuinely assessing the capabilities of these LLMs has become a challenging…

Computation and Language · Computer Science 2024-06-04 Wenhong Zhu , Hongkun Hao , Zhiwei He , Yunze Song , Yumeng Zhang , Hanxu Hu , Yiran Wei , Rui Wang , Hongyuan Lu

AdEval: Alignment-based Dynamic Evaluation to Mitigate Data Contamination in Large Language Models

As Large Language Models (LLMs) are pre-trained on ultra-large-scale corpora, the problem of data contamination is becoming increasingly serious, and there is a risk that static evaluation benchmarks overestimate the performance of LLMs. To…

Computation and Language · Computer Science 2025-08-13 Yang Fan

LLMEval-Fair: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models

Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting, critical issues that obscure true model capabilities. To address this, we introduce LLMEval-Fair, a…

Computation and Language · Computer Science 2026-04-16 Ming Zhang , Yujiong Shen , Jingyi Deng , Yuhui Wang , Huayu Sha , Kexin Tan , Qiyuan Peng , Yue Zhang , Junzhe Wang , Shichun Liu , Yueyuan Huang , Jingqi Tong , Changhao Jiang , Yilong Wu , Zhihao Zhang , Mingqi Wu , Mingxu Chai , Zhiheng Xi , Shihan Dou , Tao Gui , Qi Zhang , Xuanjing Huang

Data Contamination Can Cross Language Barriers

The opacity in developing large language models (LLMs) is raising growing concerns about the potential contamination of public benchmarks in the pre-training data. Existing contamination detection methods are typically based on the text…

Computation and Language · Computer Science 2024-10-31 Feng Yao , Yufan Zhuang , Zihao Sun , Sunan Xu , Animesh Kumar , Jingbo Shang

LNE-Blocking: An Efficient Framework for Contamination Mitigation Evaluation on Large Language Models

The problem of data contamination is now almost inevitable during the development of large language models (LLMs), with the training data commonly integrating those evaluation benchmarks even unintentionally. This problem subsequently makes…

Computation and Language · Computer Science 2025-09-19 Ruijie Hou , Yueyang Jiao , Hanxu Hu , Yingming Li , Wai Lam , Huajian Zhang , Hongyuan Lu

MAKIEval: A Multilingual Automatic WiKidata-based Framework for Cultural Awareness Evaluation for LLMs

Large language models (LLMs) are used globally across many languages, but their English-centric pretraining raises concerns about cross-lingual disparities for cultural awareness, often resulting in biased outputs. However, comprehensive…

Computation and Language · Computer Science 2025-09-23 Raoyuan Zhao , Beiduo Chen , Barbara Plank , Michael A. Hedderich

Both Text and Images Leaked! A Systematic Analysis of Data Contamination in Multimodal LLM

The rapid advancement of multimodal large language models (MLLMs) has significantly enhanced performance across benchmarks. However, data contamination-unintentional memorization of benchmark data during model training-poses critical…

Computer Vision and Pattern Recognition · Computer Science 2025-09-23 Dingjie Song , Sicheng Lai , Mingxuan Wang , Shunian Chen , Lichao Sun , Benyou Wang

A Comprehensive Survey of Contamination Detection Methods in Large Language Models

With the rise of Large Language Models (LLMs) in recent years, abundant new opportunities are emerging, but also new challenges, among which contamination is quickly becoming critical. Business applications and fundraising in Artificial…

Computation and Language · Computer Science 2025-07-11 Mathieu Ravaut , Bosheng Ding , Fangkai Jiao , Hailin Chen , Xingxuan Li , Ruochen Zhao , Chengwei Qin , Caiming Xiong , Shafiq Joty

Reasoning Multimodal Large Language Model: Data Contamination and Dynamic Evaluation

Multimodal Large Language Models (MLLMs) show impressive vision-language benchmark performance, yet growing concerns about data contamination (test set exposure during training) risk masking true generalization. This concern extends to…

Artificial Intelligence · Computer Science 2025-06-10 Ming Liu , Wensheng Zhang

Evaluating Latent Knowledge of Public Tabular Datasets in Large Language Models

Large language models (LLMs) are increasingly exposed to data contamination, i.e., performance gains driven by prior exposure of test datasets rather than generalization. However, in the context of tabular data, this problem is largely…

Computation and Language · Computer Science 2026-03-31 Matteo Silvestri , Fabiano Veglianti , Flavio Giorgi , Fabrizio Silvestri , Gabriele Tolomei

CoreEval: Automatically Building Contamination-Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation

Data contamination poses a significant challenge to the fairness of LLM evaluations in natural language processing tasks by inadvertently exposing models to test data during training. Current studies attempt to mitigate this issue by…

Computation and Language · Computer Science 2025-11-25 Jingqian Zhao , Bingbing Wang , Geng Tu , Yice Zhang , Qianlong Wang , Bin Liang , Jing Li , Ruifeng Xu

MATEval: A Multi-Agent Discussion Framework for Advancing Open-Ended Text Evaluation

Recent advancements in generative Large Language Models(LLMs) have been remarkable, however, the quality of the text generated by these models often reveals persistent issues. Evaluating the quality of text generated by these models,…

Computation and Language · Computer Science 2024-04-16 Yu Li , Shenyu Zhang , Rui Wu , Xiutian Huang , Yongrui Chen , Wenhao Xu , Guilin Qi , Dehai Min

FreeEval: A Modular Framework for Trustworthy and Efficient Evaluation of Large Language Models

The rapid development of large language model (LLM) evaluation methodologies and datasets has led to a profound challenge: integrating state-of-the-art evaluation techniques cost-effectively while ensuring reliability, reproducibility, and…

Computation and Language · Computer Science 2024-04-10 Zhuohao Yu , Chang Gao , Wenjin Yao , Yidong Wang , Zhengran Zeng , Wei Ye , Jindong Wang , Yue Zhang , Shikun Zhang

AntEval: Evaluation of Social Interaction Competencies in LLM-Driven Agents

Large Language Models (LLMs) have demonstrated their ability to replicate human behaviors across a wide range of scenarios. However, their capability in handling complex, multi-character social interactions has yet to be fully explored,…

Computation and Language · Computer Science 2024-03-06 Yuanzhi Liang , Linchao Zhu , Yi Yang

MedKGEval: A Knowledge Graph-Based Multi-Turn Evaluation Framework for Open-Ended Patient Interactions with Clinical LLMs

The reliable evaluation of large language models (LLMs) in medical applications remains an open challenge, particularly in capturing the complexity of multi-turn doctor-patient interactions that unfold in real clinical environments.…

Artificial Intelligence · Computer Science 2025-10-15 Yuechun Yu , Han Ying , Haoan Jin , Wenjian Jiang , Dong Xian , Binghao Wang , Zhou Yang , Mengyue Wu

DyVal: Dynamic Evaluation of Large Language Models for Reasoning Tasks

Large language models (LLMs) have achieved remarkable performance in various evaluation benchmarks. However, concerns are raised about potential data contamination in their considerable volume of training corpus. Moreover, the static nature…

Artificial Intelligence · Computer Science 2024-03-15 Kaijie Zhu , Jiaao Chen , Jindong Wang , Neil Zhenqiang Gong , Diyi Yang , Xing Xie

Beyond Static Datasets: A Deep Interaction Approach to LLM Evaluation

Large Language Models (LLMs) have made progress in various real-world tasks, which stimulates requirements for the evaluation of LLMs. Existing LLM evaluation methods are mainly supervised signal-based which depends on static datasets and…

Computation and Language · Computer Science 2023-09-11 Jiatong Li , Rui Li , Qi Liu

Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate

Despite the utility of Large Language Models (LLMs) across a wide range of tasks and scenarios, developing a method for reliably evaluating LLMs across varied contexts continues to be challenging. Modern evaluation approaches often use LLMs…

Computation and Language · Computer Science 2024-01-31 Steffi Chern , Ethan Chern , Graham Neubig , Pengfei Liu

Knowledge Editing for Large Language Models: A Survey

Large language models (LLMs) have recently transformed both the academic and industrial landscapes due to their remarkable capacity to understand, analyze, and generate texts based on their vast knowledge and reasoning ability.…

Computation and Language · Computer Science 2024-09-23 Song Wang , Yaochen Zhu , Haochen Liu , Zaiyi Zheng , Chen Chen , Jundong Li