English
Related papers

Related papers: KIEval: A Knowledge-grounded Interactive Evaluatio…

200 papers

Recent advancements in Large Language Models (LLMs) have demonstrated significant progress in various areas, such as text generation and code synthesis. However, the reliability of performance evaluation has come under scrutiny due to data…

Computation and Language · Computer Science 2025-06-06 Yuxing Cheng , Yi Chang , Yuan Wu

We are currently in an era of fierce competition among various large language models (LLMs) continuously pushing the boundaries of benchmark performance. However, genuinely assessing the capabilities of these LLMs has become a challenging…

Computation and Language · Computer Science 2024-06-04 Wenhong Zhu , Hongkun Hao , Zhiwei He , Yunze Song , Yumeng Zhang , Hanxu Hu , Yiran Wei , Rui Wang , Hongyuan Lu

As Large Language Models (LLMs) are pre-trained on ultra-large-scale corpora, the problem of data contamination is becoming increasingly serious, and there is a risk that static evaluation benchmarks overestimate the performance of LLMs. To…

Computation and Language · Computer Science 2025-08-13 Yang Fan

Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting, critical issues that obscure true model capabilities. To address this, we introduce LLMEval-Fair, a…

The opacity in developing large language models (LLMs) is raising growing concerns about the potential contamination of public benchmarks in the pre-training data. Existing contamination detection methods are typically based on the text…

Computation and Language · Computer Science 2024-10-31 Feng Yao , Yufan Zhuang , Zihao Sun , Sunan Xu , Animesh Kumar , Jingbo Shang

The problem of data contamination is now almost inevitable during the development of large language models (LLMs), with the training data commonly integrating those evaluation benchmarks even unintentionally. This problem subsequently makes…

Computation and Language · Computer Science 2025-09-19 Ruijie Hou , Yueyang Jiao , Hanxu Hu , Yingming Li , Wai Lam , Huajian Zhang , Hongyuan Lu

Large language models (LLMs) are used globally across many languages, but their English-centric pretraining raises concerns about cross-lingual disparities for cultural awareness, often resulting in biased outputs. However, comprehensive…

Computation and Language · Computer Science 2025-09-23 Raoyuan Zhao , Beiduo Chen , Barbara Plank , Michael A. Hedderich

The rapid advancement of multimodal large language models (MLLMs) has significantly enhanced performance across benchmarks. However, data contamination-unintentional memorization of benchmark data during model training-poses critical…

Computer Vision and Pattern Recognition · Computer Science 2025-09-23 Dingjie Song , Sicheng Lai , Mingxuan Wang , Shunian Chen , Lichao Sun , Benyou Wang

With the rise of Large Language Models (LLMs) in recent years, abundant new opportunities are emerging, but also new challenges, among which contamination is quickly becoming critical. Business applications and fundraising in Artificial…

Computation and Language · Computer Science 2025-07-11 Mathieu Ravaut , Bosheng Ding , Fangkai Jiao , Hailin Chen , Xingxuan Li , Ruochen Zhao , Chengwei Qin , Caiming Xiong , Shafiq Joty

Multimodal Large Language Models (MLLMs) show impressive vision-language benchmark performance, yet growing concerns about data contamination (test set exposure during training) risk masking true generalization. This concern extends to…

Artificial Intelligence · Computer Science 2025-06-10 Ming Liu , Wensheng Zhang

Large language models (LLMs) are increasingly exposed to data contamination, i.e., performance gains driven by prior exposure of test datasets rather than generalization. However, in the context of tabular data, this problem is largely…

Computation and Language · Computer Science 2026-03-31 Matteo Silvestri , Fabiano Veglianti , Flavio Giorgi , Fabrizio Silvestri , Gabriele Tolomei

Data contamination poses a significant challenge to the fairness of LLM evaluations in natural language processing tasks by inadvertently exposing models to test data during training. Current studies attempt to mitigate this issue by…

Computation and Language · Computer Science 2025-11-25 Jingqian Zhao , Bingbing Wang , Geng Tu , Yice Zhang , Qianlong Wang , Bin Liang , Jing Li , Ruifeng Xu

Recent advancements in generative Large Language Models(LLMs) have been remarkable, however, the quality of the text generated by these models often reveals persistent issues. Evaluating the quality of text generated by these models,…

Computation and Language · Computer Science 2024-04-16 Yu Li , Shenyu Zhang , Rui Wu , Xiutian Huang , Yongrui Chen , Wenhao Xu , Guilin Qi , Dehai Min

The rapid development of large language model (LLM) evaluation methodologies and datasets has led to a profound challenge: integrating state-of-the-art evaluation techniques cost-effectively while ensuring reliability, reproducibility, and…

Computation and Language · Computer Science 2024-04-10 Zhuohao Yu , Chang Gao , Wenjin Yao , Yidong Wang , Zhengran Zeng , Wei Ye , Jindong Wang , Yue Zhang , Shikun Zhang

Large Language Models (LLMs) have demonstrated their ability to replicate human behaviors across a wide range of scenarios. However, their capability in handling complex, multi-character social interactions has yet to be fully explored,…

Computation and Language · Computer Science 2024-03-06 Yuanzhi Liang , Linchao Zhu , Yi Yang

The reliable evaluation of large language models (LLMs) in medical applications remains an open challenge, particularly in capturing the complexity of multi-turn doctor-patient interactions that unfold in real clinical environments.…

Artificial Intelligence · Computer Science 2025-10-15 Yuechun Yu , Han Ying , Haoan Jin , Wenjian Jiang , Dong Xian , Binghao Wang , Zhou Yang , Mengyue Wu

Large language models (LLMs) have achieved remarkable performance in various evaluation benchmarks. However, concerns are raised about potential data contamination in their considerable volume of training corpus. Moreover, the static nature…

Artificial Intelligence · Computer Science 2024-03-15 Kaijie Zhu , Jiaao Chen , Jindong Wang , Neil Zhenqiang Gong , Diyi Yang , Xing Xie

Large Language Models (LLMs) have made progress in various real-world tasks, which stimulates requirements for the evaluation of LLMs. Existing LLM evaluation methods are mainly supervised signal-based which depends on static datasets and…

Computation and Language · Computer Science 2023-09-11 Jiatong Li , Rui Li , Qi Liu

Despite the utility of Large Language Models (LLMs) across a wide range of tasks and scenarios, developing a method for reliably evaluating LLMs across varied contexts continues to be challenging. Modern evaluation approaches often use LLMs…

Computation and Language · Computer Science 2024-01-31 Steffi Chern , Ethan Chern , Graham Neubig , Pengfei Liu

Large language models (LLMs) have recently transformed both the academic and industrial landscapes due to their remarkable capacity to understand, analyze, and generate texts based on their vast knowledge and reasoning ability.…

Computation and Language · Computer Science 2024-09-23 Song Wang , Yaochen Zhu , Haochen Liu , Zaiyi Zheng , Chen Chen , Jundong Li
‹ Prev 1 2 3 10 Next ›