Related papers: Large Language Models Encode Clinical Knowledge

Towards Expert-Level Medical Question Answering with Large Language Models

Recent artificial intelligence (AI) systems have reached milestones in "grand challenges" ranging from Go to protein-folding. The capability to retrieve medical knowledge, reason over it, and answer medical questions comparably to…

Computation and Language · Computer Science 2023-05-17 Karan Singhal , Tao Tu , Juraj Gottweis , Rory Sayres , Ellery Wulczyn , Le Hou , Kevin Clark , Stephen Pfohl , Heather Cole-Lewis , Darlene Neal , Mike Schaekermann , Amy Wang , Mohamed Amin , Sami Lachgar , Philip Mansfield , Sushant Prakash , Bradley Green , Ewa Dominowska , Blaise Aguera y Arcas , Nenad Tomasev , Yun Liu , Renee Wong , Christopher Semturs , S. Sara Mahdavi , Joelle Barral , Dale Webster , Greg S. Corrado , Yossi Matias , Shekoofeh Azizi , Alan Karthikesalingam , Vivek Natarajan

MedExQA: Medical Question Answering Benchmark with Multiple Explanations

This paper introduces MedExQA, a novel benchmark in medical question-answering, to evaluate large language models' (LLMs) understanding of medical knowledge through explanations. By constructing datasets across five distinct medical…

Computation and Language · Computer Science 2024-07-04 Yunsoo Kim , Jinge Wu , Yusuf Abdulle , Honghan Wu

PersianMedQA: Evaluating Large Language Models on a Persian-English Bilingual Medical Question Answering Benchmark

Large Language Models (LLMs) have achieved remarkable performance on a wide range of Natural Language Processing (NLP) benchmarks, often surpassing human-level accuracy. However, their reliability in high-stakes domains such as medicine,…

Computation and Language · Computer Science 2026-05-27 Mohammad Javad Ranjbar Kalahroodi , Amirhossein Sheikholselami , Sepehr Karimi , Sepideh Ranjbar Kalahroodi , Heshaam Faili , Azadeh Shakery

Performance of large language models in numerical vs. semantic medical knowledge: Benchmarking on evidence-based Q&As

Clinical problem-solving requires processing of semantic medical knowledge such as illness scripts and numerical medical knowledge of diagnostic tests for evidence-based decision-making. As large language models (LLMs) show promising…

Computation and Language · Computer Science 2024-07-25 Eden Avnat , Michal Levy , Daniel Herstain , Elia Yanko , Daniel Ben Joya , Michal Tzuchman Katz , Dafna Eshel , Sahar Laros , Yael Dagan , Shahar Barami , Joseph Mermelstein , Shahar Ovadia , Noam Shomron , Varda Shalev , Raja-Elie E. Abdulnour

LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation

Evaluating large language models (LLMs) in medicine is crucial because medical applications require high accuracy with little room for error. Current medical benchmarks have three main types: medical exam-based, comprehensive medical, and…

Computation and Language · Computer Science 2025-09-03 Ming Zhang , Yujiong Shen , Zelin Li , Huayu Sha , Binze Hu , Yuhui Wang , Chenhao Huang , Shichun Liu , Jingqi Tong , Changhao Jiang , Mingxu Chai , Zhiheng Xi , Shihan Dou , Tao Gui , Qi Zhang , Xuanjing Huang

PMC-LLaMA: Towards Building Open-source Language Models for Medicine

Recently, Large Language Models (LLMs) have showcased remarkable capabilities in natural language understanding. While demonstrating proficiency in everyday conversations and question-answering situations, these models frequently struggle…

Computation and Language · Computer Science 2023-08-28 Chaoyi Wu , Weixiong Lin , Xiaoman Zhang , Ya Zhang , Yanfeng Wang , Weidi Xie

Large Language Models, scientific knowledge and factuality: A framework to streamline human expert evaluation

The paper introduces a framework for the evaluation of the encoding of factual scientific knowledge, designed to streamline the manual evaluation process typically conducted by domain experts. Inferring over and extracting information from…

Computation and Language · Computer Science 2024-10-21 Magdalena Wysocka , Oskar Wysocki , Maxime Delmas , Vincent Mutel , Andre Freitas

LLM-MedQA: Enhancing Medical Question Answering through Case Studies in Large Language Models

Accurate and efficient question-answering systems are essential for delivering high-quality patient care in the medical field. While Large Language Models (LLMs) have made remarkable strides across various domains, they continue to face…

Computation and Language · Computer Science 2025-01-22 Hang Yang , Hao Chen , Hui Guo , Yineng Chen , Ching-Sheng Lin , Shu Hu , Jinrong Hu , Xi Wu , Xin Wang

A Benchmark for Long-Form Medical Question Answering

There is a lack of benchmarks for evaluating large language models (LLMs) in long-form medical question answering (QA). Most existing medical QA evaluation benchmarks focus on automatic metrics and multiple-choice questions. While valuable,…

Computation and Language · Computer Science 2024-11-21 Pedram Hosseini , Jessica M. Sin , Bing Ren , Bryceton G. Thomas , Elnaz Nouri , Ali Farahanchi , Saeed Hassanpour

Large Language Models in the Clinic: A Comprehensive Benchmark

The adoption of large language models (LLMs) to assist clinicians has attracted remarkable attention. Existing works mainly adopt the close-ended question-answering (QA) task with answer options for evaluation. However, many clinical…

Computation and Language · Computer Science 2024-10-17 Fenglin Liu , Zheng Li , Hongjian Zhou , Qingyu Yin , Jingfeng Yang , Xianfeng Tang , Chen Luo , Ming Zeng , Haoming Jiang , Yifan Gao , Priyanka Nigam , Sreyashi Nag , Bing Yin , Yining Hua , Xuan Zhou , Omid Rohanian , Anshul Thakur , Lei Clifton , David A. Clifton

M-QALM: A Benchmark to Assess Clinical Reading Comprehension and Knowledge Recall in Large Language Models via Question Answering

There is vivid research on adapting Large Language Models (LLMs) to perform a variety of tasks in high-stakes domains such as healthcare. Despite their popularity, there is a lack of understanding of the extent and contributing factors that…

Computation and Language · Computer Science 2024-06-07 Anand Subramanian , Viktor Schlegel , Abhinav Ramesh Kashyap , Thanh-Tung Nguyen , Vijay Prakash Dwivedi , Stefan Winkler

Emulating Human Cognitive Processes for Expert-Level Medical Question-Answering with Large Language Models

In response to the pressing need for advanced clinical problem-solving tools in healthcare, we introduce BooksMed, a novel framework based on a Large Language Model (LLM). BooksMed uniquely emulates human cognitive processes to deliver…

Computation and Language · Computer Science 2023-10-18 Khushboo Verma , Marina Moore , Stephanie Wottrich , Karla Robles López , Nishant Aggarwal , Zeel Bhatt , Aagamjit Singh , Bradford Unroe , Salah Basheer , Nitish Sachdeva , Prinka Arora , Harmanjeet Kaur , Tanupreet Kaur , Tevon Hood , Anahi Marquez , Tushar Varshney , Nanfu Deng , Azaan Ramani , Pawanraj Ishwara , Maimoona Saeed , Tatiana López Velarde Peña , Bryan Barksdale , Sushovan Guha , Satwant Kumar

Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering

We introduce KoLasSimpleQA, the first benchmark evaluating the multilingual factual ability of Large Language Models (LLMs). Inspired by existing research, we created the question set with features such as single knowledge point coverage,…

Computation and Language · Computer Science 2025-05-23 Bowen Jiang , Runchuan Zhu , Jiang Wu , Zinco Jiang , Yifan He , Junyuan Gao , Jia Yu , Rui Min , Yinfan Wang , Haote Yang , Songyang Zhang , Dahua Lin , Lijun Wu , Conghui He

MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering

Large Language Models (LLMs) have the potential of facilitating the development of Artificial Intelligence technology to assist medical experts for interactive decision support, which has been demonstrated by their competitive performances…

Computation and Language · Computer Science 2024-11-12 Iñigo Alonso , Maite Oronoz , Rodrigo Agerri

MultifacetEval: Multifaceted Evaluation to Probe LLMs in Mastering Medical Knowledge

Large language models (LLMs) have excelled across domains, also delivering notable performance on the medical evaluation benchmarks, such as MedQA. However, there still exists a significant gap between the reported performance and the…

Computation and Language · Computer Science 2024-06-06 Yuxuan Zhou , Xien Liu , Chen Ning , Ji Wu

Humans and Large Language Models in Clinical Decision Support: A Study with Medical Calculators

Although large language models (LLMs) have been assessed for general medical knowledge using licensing exams, their ability to support clinical decision-making, such as selecting medical calculators, remains uncertain. We assessed nine…

Computation and Language · Computer Science 2025-03-25 Nicholas Wan , Qiao Jin , Joey Chan , Guangzhi Xiong , Serina Applebaum , Aidan Gilson , Reid McMurry , R. Andrew Taylor , Aidong Zhang , Qingyu Chen , Zhiyong Lu

MedCalc-Bench: Evaluating Large Language Models for Medical Calculations

As opposed to evaluating computation and logic-based reasoning, current benchmarks for evaluating large language models (LLMs) in medicine are primarily focused on question-answering involving domain knowledge and descriptive reasoning.…

Computation and Language · Computer Science 2024-07-02 Nikhil Khandekar , Qiao Jin , Guangzhi Xiong , Soren Dunn , Serina S Applebaum , Zain Anwar , Maame Sarfo-Gyamfi , Conrad W Safranek , Abid A Anwar , Andrew Zhang , Aidan Gilson , Maxwell B Singer , Amisha Dave , Andrew Taylor , Aidong Zhang , Qingyu Chen , Zhiyong Lu

MedREQAL: Examining Medical Knowledge Recall of Large Language Models via Question Answering

In recent years, Large Language Models (LLMs) have demonstrated an impressive ability to encode knowledge during pre-training on large text corpora. They can leverage this knowledge for downstream tasks like question answering (QA), even in…

Computation and Language · Computer Science 2024-06-11 Juraj Vladika , Phillip Schneider , Florian Matthes

Knowledge-tuning Large Language Models with Structured Medical Knowledge Bases for Reliable Response Generation in Chinese

Large Language Models (LLMs) have demonstrated remarkable success in diverse natural language processing (NLP) tasks in general domains. However, LLMs sometimes generate responses with the hallucination about medical facts due to limited…

Computation and Language · Computer Science 2025-01-14 Haochun Wang , Sendong Zhao , Zewen Qiang , Zijian Li , Nuwa Xi , Yanrui Du , MuZhen Cai , Haoqiang Guo , Yuhan Chen , Haoming Xu , Bing Qin , Ting Liu

Evaluating LLMs Across Multi-Cognitive Levels: From Medical Knowledge Mastery to Scenario-Based Problem Solving

Large language models (LLMs) have demonstrated remarkable performance on various medical benchmarks, but their capabilities across different cognitive levels remain underexplored. Inspired by Bloom's Taxonomy, we propose a…

Computation and Language · Computer Science 2025-06-11 Yuxuan Zhou , Xien Liu , Chenwei Yan , Chen Ning , Xiao Zhang , Boxun Li , Xiangling Fu , Shijin Wang , Guoping Hu , Yu Wang , Ji Wu