Related papers: Humanly Certifying Superhuman Classifiers

Evaluating Superhuman Models with Consistency Checks

If machine learning models were to achieve superhuman abilities at various reasoning or decision-making tasks, how would we go about evaluating such models, given that humans would necessarily be poor proxies for ground truth? In this…

Machine Learning · Computer Science 2023-10-20 Lukas Fluri , Daniel Paleka , Florian Tramèr

How Accurate Does It Feel? -- Human Perception of Different Types of Classification Mistakes

Supervised machine learning utilizes large datasets, often with ground truth labels annotated by humans. While some data points are easy to classify, others are hard to classify, which reduces the inter-annotator agreement. This causes…

Human-Computer Interaction · Computer Science 2023-02-14 Andrea Papenmeier , Dagmar Kern , Daniel Hienert , Yvonne Kammerer , Christin Seifert

Are Human Explanations Always Helpful? Towards Objective Evaluation of Human Natural Language Explanations

Human-annotated labels and explanations are critical for training explainable NLP models. However, unlike human-annotated labels whose quality is easier to calibrate (e.g., with a majority vote), human-crafted free-form explanations can be…

Computation and Language · Computer Science 2023-05-23 Bingsheng Yao , Prithviraj Sen , Lucian Popa , James Hendler , Dakuo Wang

On Human Predictions with Explanations and Predictions of Machine Learning Models: A Case Study on Deception Detection

Humans are the final decision makers in critical tasks that involve ethical and legal concerns, ranging from recidivism prediction, to medical diagnosis, to fighting against fake news. Although machine learning models can sometimes achieve…

Artificial Intelligence · Computer Science 2019-01-10 Vivian Lai , Chenhao Tan

HumanAL: Calibrating Human Matching Beyond a Single Task

This work offers a novel view on the use of human input as labels, acknowledging that humans may err. We build a behavioral profile for human annotators which is used as a feature representation of the provided input. We show that by…

Databases · Computer Science 2022-05-09 Roee Shraga

Learning in Repeated Games: Human Versus Machine

While Artificial Intelligence has successfully outperformed humans in complex combinatorial games (such as chess and checkers), humans have retained their supremacy in social interactions that require intuition and adaptation, such as…

Computers and Society · Computer Science 2014-04-22 Fatimah Ishowo-Oloko , Jacob Crandall , Manuel Cebrian , Sherief Abdallah , Iyad Rahwan

Has Machine Translation Evaluation Achieved Human Parity? The Human Reference and the Limits of Progress

In Machine Translation (MT) evaluation, metric performance is assessed based on agreement with human judgments. In recent years, automatic metrics have demonstrated increasingly high levels of agreement with humans. To gain a clearer…

Computation and Language · Computer Science 2025-06-25 Lorenzo Proietti , Stefano Perrella , Roberto Navigli

Assessing Human Error Against a Benchmark of Perfection

An increasing number of domains are providing us with detailed trace data on human decisions in settings where we can evaluate the quality of these decisions via an algorithm. Motivated by this development, an emerging line of work has…

Artificial Intelligence · Computer Science 2016-06-17 Ashton Anderson , Jon Kleinberg , Sendhil Mullainathan

Human-AI Complementarity: A Goal for Amplified Oversight

Human feedback is critical for aligning AI systems to human values. As AI capabilities improve and AI is used to tackle more challenging tasks, verifying quality and safety becomes increasingly challenging. This paper explores how we can…

Artificial Intelligence · Computer Science 2025-10-31 Rishub Jain , Sophie Bridgers , Lili Janzer , Rory Greig , Tian Huey Teh , Vladimir Mikulik

Is Human-Like Text Liked by Humans? Multilingual Human Detection and Preference Against AI

Prior studies have shown that distinguishing text generated by Large Language Models (LLMs) from human-written one is highly challenging for humans, and often no better than random guessing. To verify the generalizability of this finding…

Computation and Language · Computer Science 2026-04-30 Yuxia Wang , Rui Xing , Jonibek Mansurov , Giovanni Puccetti , Zhuohan Xie , Minh Ngoc Ta , Jiahui Geng , Jinyan Su , Mervat Abassy , Saad El Dine Ahmed , Kareem Elozeiri , Nurkhan Laiyk , Maiya Goloburda , Tarek Mahmoud , Raj Vardhan Tomar , Alexander Aziz , Ryuto Koike , Masahiro Kaneko , Artem Shelmanov , Ekaterina Artemova , Vladislav Mikhailov , Akim Tsvigun , Alham Fikri Aji , Nizar Habash , Iryna Gurevych , Preslav Nakov

Human-Algorithm Collaboration: Achieving Complementarity and Avoiding Unfairness

Much of machine learning research focuses on predictive accuracy: given a task, create a machine learning model (or algorithm) that maximizes accuracy. In many settings, however, the final prediction or decision of a system is under the…

Computers and Society · Computer Science 2022-06-02 Kate Donahue , Alexandra Chouldechova , Krishnaram Kenthapadi

Human Perception of Performance

Humans are routinely asked to evaluate the performance of other individuals, separating success from failure and affecting outcomes from science to education and sports. Yet, in many contexts, the metrics driving the human evaluation…

Physics and Society · Physics 2017-12-07 Luca Pappalardo , Paolo Cintia , Dino Pedreschi , Fosca Giannotti , Albert-Laszlo Barabasi

Reliable Annotations with Less Effort: Evaluating LLM-Human Collaboration in Search Clarifications

Despite growing interest in using large language models (LLMs) to automate annotation, their effectiveness in complex, nuanced, and multi-dimensional labelling tasks remains relatively underexplored. This study focuses on annotation for the…

Information Retrieval · Computer Science 2025-07-02 Leila Tavakoli , Hamed Zamani

Human-Like Navigation Behavior: A Statistical Evaluation Framework

Recent advancements in deep reinforcement learning have brought forth an impressive display of highly skilled artificial agents capable of complex intelligent behavior. In video games, these artificial agents are increasingly deployed as…

Machine Learning · Statistics 2022-03-14 Ian Colbert , Mehdi Saeedi

Do Human Rationales Improve Machine Explanations?

Work on "learning with rationales" shows that humans providing explanations to a machine learning system can improve the system's predictive accuracy. However, this work has not been connected to work in "explainable AI" which concerns…

Computation and Language · Computer Science 2019-06-03 Julia Strout , Ye Zhang , Raymond J. Mooney

On Benchmarking Human-Like Intelligence in Machines

Recent benchmark studies have claimed that AI has approached or even surpassed human-level performances on various cognitive tasks. However, this position paper argues that current AI evaluation paradigms are insufficient for assessing…

Artificial Intelligence · Computer Science 2025-03-03 Lance Ying , Katherine M. Collins , Lionel Wong , Ilia Sucholutsky , Ryan Liu , Adrian Weller , Tianmin Shu , Thomas L. Griffiths , Joshua B. Tenenbaum

To Trust, or Not to Trust? A Study of Human Bias in Automated Video Interview Assessments

Supervised systems require human labels for training. But, are humans themselves always impartial during the annotation process? We examine this question in the context of automated assessment of human behavioral tasks. Specifically, we…

Human-Computer Interaction · Computer Science 2019-12-02 Chee Wee Leong , Katrina Roohr , Vikram Ramanarayanan , Michelle P. Martin-Raugh , Harrison Kell , Rutuja Ubale , Yao Qian , Zydrune Mladineo , Laura McCulla

An Interactive Human-Machine Learning Interface for Collecting and Learning from Complex Annotations

Human-Computer Interaction has been shown to lead to improvements in machine learning systems by boosting model performance, accelerating learning and building user confidence. In this work, we aim to alleviate the expectation that human…

Machine Learning · Computer Science 2024-03-29 Jonathan Erskine , Matt Clifford , Alexander Hepburn , Raúl Santos-Rodríguez

Automatable Evaluation Method Oriented toward Behaviour Believability for Video Games

Classic evaluation methods of believable agents are time-consuming because they involve many human to judge agents. They are well suited to validate work on new believable behaviours models. However, during the implementation, numerous…

Artificial Intelligence · Computer Science 2010-09-03 Fabien Tencé , Cédric Buche

Challenging common interpretability assumptions in feature attribution explanations

As machine learning and algorithmic decision making systems are increasingly being leveraged in high-stakes human-in-the-loop settings, there is a pressing need to understand the rationale of their predictions. Researchers have responded to…

Machine Learning · Computer Science 2020-12-07 Jonathan Dinu , Jeffrey Bigham , J. Zico Kolter