Related papers: Measuring Massive Multitask Language Understanding

Measuring Massive Multitask Chinese Understanding

The development of large-scale Chinese language models is flourishing, yet there is a lack of corresponding capability assessments. Therefore, we propose a test to measure the multitask accuracy of large Chinese language models. This test…

Computation and Language · Computer Science 2026-05-28 Hui Zeng

M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark for Chinese Large Language Models

Large language models have recently made tremendous progress in a variety of aspects, e.g., cross-task generalization, instruction following. Comprehensively evaluating the capability of large language models in multiple tasks is of great…

Computation and Language · Computer Science 2023-05-23 Chuang Liu , Renren Jin , Yuqi Ren , Linhao Yu , Tianyu Dong , Xiaohan Peng , Shuting Zhang , Jianxiang Peng , Peiyi Zhang , Qingqing Lyu , Xiaowen Su , Qun Liu , Deyi Xiong

Measuring Social Norms of Large Language Models

We present a new challenge to examine whether large language models understand social norms. In contrast to existing datasets, our dataset requires a fundamental understanding of social norms to solve. Our dataset features the largest set…

Computation and Language · Computer Science 2024-05-24 Ye Yuan , Kexin Tang , Jianhao Shen , Ming Zhang , Chenguang Wang

Optimizing Multi-Task Learning for Enhanced Performance in Large Language Models

This study aims to explore the performance improvement method of large language models based on GPT-4 under the multi-task learning framework and conducts experiments on two tasks: text classification and automatic summary generation.…

Computation and Language · Computer Science 2024-12-10 Zhen Qi , Jiajing Chen , Shuo Wang , Bingying Liu , Hongye Zheng , Chihang Wang

Language Models are Few-Shot Learners

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires…

Computation and Language · Computer Science 2020-07-24 Tom B. Brown , Benjamin Mann , Nick Ryder , Melanie Subbiah , Jared Kaplan , Prafulla Dhariwal , Arvind Neelakantan , Pranav Shyam , Girish Sastry , Amanda Askell , Sandhini Agarwal , Ariel Herbert-Voss , Gretchen Krueger , Tom Henighan , Rewon Child , Aditya Ramesh , Daniel M. Ziegler , Jeffrey Wu , Clemens Winter , Christopher Hesse , Mark Chen , Eric Sigler , Mateusz Litwin , Scott Gray , Benjamin Chess , Jack Clark , Christopher Berner , Sam McCandlish , Alec Radford , Ilya Sutskever , Dario Amodei

Exploring the Benefits of Training Expert Language Models over Instruction Tuning

Recently, Language Models (LMs) instruction-tuned on multiple tasks, also known as multitask-prompted fine-tuning (MT), have shown the capability to generalize to unseen tasks. Previous work has shown that scaling the number of training…

Computation and Language · Computer Science 2023-02-10 Joel Jang , Seungone Kim , Seonghyeon Ye , Doyoung Kim , Lajanugen Logeswaran , Moontae Lee , Kyungjae Lee , Minjoon Seo

Text Alignment Is An Efficient Unified Model for Massive NLP Tasks

Large language models (LLMs), typically designed as a function of next-word prediction, have excelled across extensive NLP tasks. Despite the generality, next-word prediction is often not an efficient formulation for many of the tasks,…

Computation and Language · Computer Science 2023-11-03 Yuheng Zha , Yichi Yang , Ruichen Li , Zhiting Hu

TruthfulQA: Measuring How Models Mimic Human Falsehoods

We propose a benchmark to measure whether a language model is truthful in generating answers to questions. The benchmark comprises 817 questions that span 38 categories, including health, law, finance and politics. We crafted questions that…

Computation and Language · Computer Science 2022-05-10 Stephanie Lin , Jacob Hilton , Owain Evans

Language Model Behavior: A Comprehensive Survey

Transformer language models have received widespread public attention, yet their generated text is often surprising even to NLP researchers. In this survey, we discuss over 250 recent studies of English language model behavior before…

Computation and Language · Computer Science 2023-08-29 Tyler A. Chang , Benjamin K. Bergen

Language Modelling as a Multi-Task Problem

In this paper, we propose to study language modelling as a multi-task problem, bringing together three strands of research: multi-task learning, linguistics, and interpretability. Based on hypotheses derived from linguistic theory, we…

Computation and Language · Computer Science 2021-01-28 Lucas Weber , Jaap Jumelet , Elia Bruni , Dieuwke Hupkes

Separating form and meaning: Using self-consistency to quantify task understanding across multiple senses

At the staggering pace with which the capabilities of large language models (LLMs) are increasing, creating future-proof evaluation sets to assess their understanding becomes more and more challenging. In this paper, we propose a novel…

Computation and Language · Computer Science 2023-12-21 Xenia Ohmer , Elia Bruni , Dieuwke Hupkes

Text-Based Approaches to Item Difficulty Modeling in Large-Scale Assessments: A Systematic Review

Item difficulty plays a crucial role in test performance, interpretability of scores, and equity for all test-takers, especially in large-scale assessments. Traditional approaches to item difficulty modeling rely on field testing and…

Computation and Language · Computer Science 2025-09-30 Sydney Peters , Nan Zhang , Hong Jiao , Ming Li , Tianyi Zhou , Robert Lissitz

REBUS: A Robust Evaluation Benchmark of Understanding Symbols

We propose a new benchmark evaluating the performance of multimodal large language models on rebus puzzles. The dataset covers 333 original examples of image-based wordplay, cluing 13 categories such as movies, composers, major cities, and…

Computation and Language · Computer Science 2024-06-05 Andrew Gritsevskiy , Arjun Panickssery , Aaron Kirtland , Derik Kauffman , Hans Gundlach , Irina Gritsevskaya , Joe Cavanagh , Jonathan Chiang , Lydia La Roux , Michelle Hung

Estimating Large Language Model Capabilities without Labeled Test Data

Large Language Models (LLMs) have the impressive ability to perform in-context learning (ICL) from only a few examples, but the success of ICL varies widely from task to task. Thus, it is important to quickly determine whether ICL is…

Computation and Language · Computer Science 2023-10-27 Harvey Yiyun Fu , Qinyuan Ye , Albert Xu , Xiang Ren , Robin Jia

Modular Approach to Machine Reading Comprehension: Mixture of Task-Aware Experts

In this work we present a Mixture of Task-Aware Experts Network for Machine Reading Comprehension on a relatively small dataset. We particularly focus on the issue of common-sense learning, enforcing the common ground knowledge by…

Computation and Language · Computer Science 2022-10-05 Anirudha Rayasam , Anusha Kamath , Gabriel Bayomi Tinoco Kalejaiye

LLMs Are Not Intelligent Thinkers: Introducing Mathematical Topic Tree Benchmark for Comprehensive Evaluation of LLMs

Large language models (LLMs) demonstrate impressive capabilities in mathematical reasoning. However, despite these achievements, current evaluations are mostly limited to specific mathematical topics, and it remains unclear whether LLMs are…

Computation and Language · Computer Science 2025-04-01 Arash Gholami Davoodi , Seyed Pouyan Mousavi Davoudi , Pouya Pezeshkpour

Constructing Datasets for Multi-hop Reading Comprehension Across Documents

Most Reading Comprehension methods limit themselves to queries which can be answered using a single sentence, paragraph, or document. Enabling models to combine disjoint pieces of textual evidence would extend the scope of machine…

Computation and Language · Computer Science 2018-06-12 Johannes Welbl , Pontus Stenetorp , Sebastian Riedel

Spotlights and Blindspots: Evaluating Machine-Generated Text Detection

With the rise of generative language models, machine-generated text detection has become a critical challenge. A wide variety of models is available, but inconsistent datasets, evaluation metrics, and assessment strategies obscure…

Computation and Language · Computer Science 2026-04-23 Kevin Stowe , Kailash Patil

Grammatical Templates: Improving Text Difficulty Evaluation for Language Learners

Language students are most engaged while reading texts at an appropriate difficulty level. However, existing methods of evaluating text difficulty focus mainly on vocabulary and do not prioritize grammatical features, hence they do not work…

Computation and Language · Computer Science 2017-02-17 Shuhan Wang , Erik Andersen

Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve

The widespread adoption of large language models (LLMs) makes it important to recognize their strengths and limitations. We argue that in order to develop a holistic understanding of these systems we need to consider the problem that they…

Computation and Language · Computer Science 2023-09-26 R. Thomas McCoy , Shunyu Yao , Dan Friedman , Matthew Hardy , Thomas L. Griffiths