Related papers: Exploring and Analyzing Machine Commonsense Benchm…

Benchmarks for Automated Commonsense Reasoning: A Survey

More than one hundred benchmarks have been developed to test the commonsense knowledge and commonsense reasoning abilities of artificial intelligence (AI) systems. However, these benchmarks are often flawed and many aspects of common sense…

Artificial Intelligence · Computer Science 2023-02-24 Ernest Davis

What Really is Commonsense Knowledge?

Commonsense datasets have been well developed in Natural Language Processing, mainly through crowdsource human annotation. However, there are debates on the genuineness of commonsense reasoning benchmarks. In specific, a significant portion…

Computation and Language · Computer Science 2024-11-07 Quyet V. Do , Junze Li , Tung-Duong Vuong , Zhaowei Wang , Yangqiu Song , Xiaojuan Ma

Benchmarking Knowledge-Enhanced Commonsense Question Answering via Knowledge-to-Text Transformation

A fundamental ability of humans is to utilize commonsense knowledge in language understanding and question answering. In recent years, many knowledge-enhanced Commonsense Question Answering (CQA) approaches have been proposed. However, it…

Computation and Language · Computer Science 2021-01-06 Ning Bian , Xianpei Han , Bo Chen , Le Sun

A Theoretically Grounded Benchmark for Evaluating Machine Commonsense

Programming machines with commonsense reasoning (CSR) abilities is a longstanding challenge in the Artificial Intelligence community. Current CSR benchmarks use multiple-choice (and in relatively fewer cases, generative) question-answering…

Computation and Language · Computer Science 2022-07-18 Henrique Santos , Ke Shen , Alice M. Mulvehill , Yasaman Razeghi , Deborah L. McGuinness , Mayank Kejriwal

Towards Generalizable Neuro-Symbolic Systems for Commonsense Question Answering

Non-extractive commonsense QA remains a challenging AI task, as it requires systems to reason about, synthesize, and gather disparate pieces of information, in order to generate responses to queries. Recent approaches on such tasks show…

Computation and Language · Computer Science 2019-11-01 Kaixin Ma , Jonathan Francis , Quanyang Lu , Eric Nyberg , Alessandro Oltramari

Compound-QA: A Benchmark for Evaluating LLMs on Compound Questions

Large language models (LLMs) demonstrate remarkable performance across various tasks, prompting researchers to develop diverse evaluation benchmarks. However, most benchmarks typically measure the ability of LLMs to respond to individual…

Computation and Language · Computer Science 2026-01-30 Yutao Hou , Yajing Luo , Zhiwen Ruan , Hongru Wang , Weifeng Ge , Yun Chen , Guanhua Chen

MEQA: A Meta-Evaluation Framework for Question & Answer LLM Benchmarks

As Large Language Models (LLMs) advance, their potential for widespread societal impact grows simultaneously. Hence, rigorous LLM evaluations are both a technical necessity and social imperative. While numerous evaluation benchmarks have…

Computation and Language · Computer Science 2025-04-22 Jaime Raldua Veuthey , Zainab Ali Majid , Suhas Hariharan , Jacob Haimes

mCSQA: Multilingual Commonsense Reasoning Dataset with Unified Creation Strategy by Language Models and Humans

It is very challenging to curate a dataset for language-specific knowledge and common sense in order to evaluate natural language understanding capabilities of language models. Due to the limitation in the availability of annotators, most…

Computation and Language · Computer Science 2024-06-07 Yusuke Sakai , Hidetaka Kamigaito , Taro Watanabe

Benchmarks as Microscopes: A Call for Model Metrology

Modern language models (LMs) pose a new challenge in capability assessment. Static benchmarks inevitably saturate without providing confidence in the deployment tolerances of LM-based systems, but developers nonetheless claim that their…

Software Engineering · Computer Science 2024-07-31 Michael Saxon , Ari Holtzman , Peter West , William Yang Wang , Naomi Saphra

COM2SENSE: A Commonsense Reasoning Benchmark with Complementary Sentences

Commonsense reasoning is intuitive for humans but has been a long-term challenge for artificial intelligence (AI). Recent advancements in pretrained language models have shown promising results on several commonsense benchmark datasets.…

Computation and Language · Computer Science 2021-06-03 Shikhar Singh , Nuan Wen , Yu Hou , Pegah Alipoormolabashi , Te-Lin Wu , Xuezhe Ma , Nanyun Peng

Quantum Computer Benchmarking: An Explorative Systematic Literature Review

As quantum computing (QC) continues to evolve in hardware and software, measuring progress in this complex and diverse field remains a challenge. To track progress, uncover bottlenecks, and evaluate community efforts, benchmarks play a…

Quantum Physics · Physics 2025-09-04 Tobias Rohe , Federico Harjes Ruiloba , Sabrina Egger , Sebastian von Beck , Jonas Stein , Claudia Linnhoff-Popien

An MLCommons Scientific Benchmarks Ontology

Scientific machine learning research spans diverse domains and data modalities, yet existing benchmark efforts remain siloed and lack standardization. This makes novel and transformative applications of machine learning to critical…

Machine Learning · Computer Science 2025-11-11 Ben Hawks , Gregor von Laszewski , Matthew D. Sinclair , Marco Colombo , Shivaram Venkataraman , Rutwik Jain , Yiwei Jiang , Nhan Tran , Geoffrey Fox

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

When answering a question, people often draw upon their rich world knowledge in addition to the particular context. Recent work has focused primarily on answering questions given some relevant document or context, and required very little…

Computation and Language · Computer Science 2019-03-19 Alon Talmor , Jonathan Herzig , Nicholas Lourie , Jonathan Berant

Semantic Categorization of Social Knowledge for Commonsense Question Answering

Large pre-trained language models (PLMs) have led to great success on various commonsense question answering (QA) tasks in an end-to-end fashion. However, little attention has been paid to what commonsense knowledge is needed to deeply…

Computation and Language · Computer Science 2021-09-14 Gengyu Wang , Xiaochen Hou , Diyi Yang , Kathleen McKeown , Jing Huang

BenCSSmark: Making the Social Sciences Count in LLM Research

This position paper argues that the under-representation of social science tasks in contemporary LLM benchmarks limits advances in both LLM evaluation and social scientific inquiry. Benchmarks -- standardized tools for assessing…

Computation and Language · Computer Science 2026-05-07 Arnault Chatelain , Étienne Ollion , Qianwen Guan , Diandra Fabre , Lorraine Goeuriot , Emile Chapuis , Abdelkrim Beloued , Marie Candito , Nicolas Hervé , Didier Schwab

CBench: Towards Better Evaluation of Question Answering Over Knowledge Graphs

Recently, there has been an increase in the number of knowledge graphs that can be only queried by experts. However, describing questions using structured queries is not straightforward for non-expert users who need to have sufficient…

Computation and Language · Computer Science 2021-05-04 Abdelghny Orogat , Isabelle Liu , Ahmed El-Roby

Fusing Context Into Knowledge Graph for Commonsense Question Answering

Commonsense question answering (QA) requires a model to grasp commonsense and factual knowledge to answer questions about world events. Many prior methods couple language modeling with knowledge graphs (KG). However, although a KG contains…

Computation and Language · Computer Science 2021-08-04 Yichong Xu , Chenguang Zhu , Ruochen Xu , Yang Liu , Michael Zeng , Xuedong Huang

PMLB: A Large Benchmark Suite for Machine Learning Evaluation and Comparison

The selection, development, or comparison of machine learning methods in data mining can be a difficult task based on the target problem and goals of a particular study. Numerous publicly available real-world and simulated benchmark…

Machine Learning · Computer Science 2017-03-03 Randal S. Olson , William La Cava , Patryk Orzechowski , Ryan J. Urbanowicz , Jason H. Moore

mSCoRe: a $M$ultilingual and Scalable Benchmark for $S$kill-based $Co$mmonsense $Re$asoning

Recent advancements in reasoning-reinforced Large Language Models (LLMs) have shown remarkable capabilities in complex reasoning tasks. However, the mechanism underlying their utilization of different human reasoning skills remains poorly…

Computation and Language · Computer Science 2025-08-15 Nghia Trung Ngo , Franck Dernoncourt , Thien Huu Nguyen

CIKQA: Learning Commonsense Inference with a Unified Knowledge-in-the-loop QA Paradigm

Recently, the community has achieved substantial progress on many commonsense reasoning benchmarks. However, it is still unclear what is learned from the training process: the knowledge, inference capability, or both? We argue that due to…

Computation and Language · Computer Science 2022-10-13 Hongming Zhang , Yintong Huo , Yanai Elazar , Yangqiu Song , Yoav Goldberg , Dan Roth