Related papers: tasksource: A Dataset Harmonization Framework for …

Datasets: A Community Library for Natural Language Processing

The scale, variety, and quantity of publicly-available NLP datasets has grown rapidly as researchers propose new tasks, larger models, and novel benchmarks. Datasets is a community library for contemporary NLP designed to support this…

Computation and Language · Computer Science 2021-09-08 Quentin Lhoest , Albert Villanova del Moral , Yacine Jernite , Abhishek Thakur , Patrick von Platen , Suraj Patil , Julien Chaumond , Mariama Drame , Julien Plu , Lewis Tunstall , Joe Davison , Mario Šaško , Gunjan Chhablani , Bhavitvya Malik , Simon Brandeis , Teven Le Scao , Victor Sanh , Canwen Xu , Nicolas Patry , Angelina McMillan-Major , Philipp Schmid , Sylvain Gugger , Clément Delangue , Théo Matussière , Lysandre Debut , Stas Bekman , Pierric Cistac , Thibault Goehringer , Victor Mustar , François Lagunas , Alexander M. Rush , Thomas Wolf

A Data-Centric Framework for Composable NLP Workflows

Empirical natural language processing (NLP) systems in application domains (e.g., healthcare, finance, education) involve interoperation among multiple components, ranging from data ingestion, human annotation, to text retrieval, analysis,…

Computation and Language · Computer Science 2021-09-03 Zhengzhong Liu , Guanxiong Ding , Avinash Bukkittu , Mansi Gupta , Pengzhi Gao , Atif Ahmed , Shikun Zhang , Xin Gao , Swapnil Singhavi , Linwei Li , Wei Wei , Zecong Hu , Haoran Shi , Haoying Zhang , Xiaodan Liang , Teruko Mitamura , Eric P. Xing , Zhiting Hu

Reusable Templates and Guides For Documenting Datasets and Models for Natural Language Processing and Generation: A Case Study of the HuggingFace and GEM Data and Model Cards

Developing documentation guidelines and easy-to-use templates for datasets and models is a challenging task, especially given the variety of backgrounds, skills, and incentives of the people involved in the building of natural language…

Databases · Computer Science 2021-08-18 Angelina McMillan-Major , Salomey Osei , Juan Diego Rodriguez , Pawan Sasanka Ammanamanchi , Sebastian Gehrmann , Yacine Jernite

HugNLP: A Unified and Comprehensive Library for Natural Language Processing

In this paper, we introduce HugNLP, a unified and comprehensive library for natural language processing (NLP) with the prevalent backend of HuggingFace Transformers, which is designed for NLP researchers to easily utilize off-the-shelf…

Computation and Language · Computer Science 2023-03-01 Jianing Wang , Nuo Chen , Qiushi Sun , Wenkang Huang , Chengyu Wang , Ming Gao

Navigating Dataset Documentations in AI: A Large-Scale Analysis of Dataset Cards on Hugging Face

Advances in machine learning are closely tied to the creation of datasets. While data documentation is widely recognized as essential to the reliability, reproducibility, and transparency of ML, we lack a systematic empirical understanding…

Machine Learning · Computer Science 2024-01-26 Xinyu Yang , Weixin Liang , James Zou

HumSet: Dataset of Multilingual Information Extraction and Classification for Humanitarian Crisis Response

Timely and effective response to humanitarian crises requires quick and accurate analysis of large amounts of text data - a process that can highly benefit from expert-assisted NLP systems trained on validated and annotated data in the…

Computation and Language · Computer Science 2022-11-08 Selim Fekih , Nicolò Tamagnone , Benjamin Minixhofer , Ranjan Shrestha , Ximena Contla , Ewan Oglethorpe , Navid Rekabsaz

Using a thousand optimization tasks to learn hyperparameter search strategies

We present TaskSet, a dataset of tasks for use in training and evaluating optimizers. TaskSet is unique in its size and diversity, containing over a thousand tasks ranging from image classification with fully connected or convolutional…

Machine Learning · Computer Science 2020-04-02 Luke Metz , Niru Maheswaranathan , Ruoxi Sun , C. Daniel Freeman , Ben Poole , Jascha Sohl-Dickstein

On the synchronization between Hugging Face pre-trained language models and their upstream GitHub repository

Pre-trained language models (PTLMs) have transformed natural language processing (NLP), enabling major advances in tasks such as text generation and translation. Similar to software package management, PTLMs are developed using code and…

Software Engineering · Computer Science 2026-01-27 Adekunle Ajibode , Abdul Ali Bangash , Oussama Ben Sghaier , Bram Adams , Ahmed E. Hassan

Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets

Crowdsourcing has been the prevalent paradigm for creating natural language understanding datasets in recent years. A common crowdsourcing practice is to recruit a small number of high-quality workers, and have them massively generate…

Computation and Language · Computer Science 2019-08-29 Mor Geva , Yoav Goldberg , Jonathan Berant

MLM: A Benchmark Dataset for Multitask Learning with Multiple Languages and Modalities

In this paper, we introduce the MLM (Multiple Languages and Modalities) dataset - a new resource to train and evaluate multitask systems on samples in multiple modalities and three languages. The generation process and inclusion of semantic…

Machine Learning · Computer Science 2020-10-27 Jason Armitage , Endri Kacupaj , Golsa Tahmasebzadeh , Swati , Maria Maleshkova , Ralph Ewerth , Jens Lehmann

Long Code Arena: a Set of Benchmarks for Long-Context Code Models

Nowadays, the fields of code and natural language processing are evolving rapidly. In particular, models become better at processing long context windows - supported context sizes have increased by orders of magnitude over the last few…

Machine Learning · Computer Science 2024-06-18 Egor Bogomolov , Aleksandra Eliseeva , Timur Galimzyanov , Evgeniy Glukhov , Anton Shapkin , Maria Tigina , Yaroslav Golubev , Alexander Kovrigin , Arie van Deursen , Maliheh Izadi , Timofey Bryksin

Benchmarking Recommendation, Classification, and Tracing Based on Hugging Face Knowledge Graph

The rapid growth of open source machine learning (ML) resources, such as models and datasets, has accelerated IR research. However, existing platforms like Hugging Face do not explicitly utilize structured representations, limiting advanced…

Information Retrieval · Computer Science 2025-05-26 Qiaosheng Chen , Kaijia Huang , Xiao Zhou , Weiqing Luo , Yuanning Cui , Gong Cheng

Efficient Long Sequence Encoding via Synchronization

Pre-trained Transformer models have achieved successes in a wide range of NLP tasks, but are inefficient when dealing with long input sequences. Existing studies try to overcome this challenge via segmenting the long sequence followed by…

Computation and Language · Computer Science 2022-03-16 Xiangyang Mou , Mo Yu , Bingsheng Yao , Lifu Huang

CrowdWorkSheets: Accounting for Individual and Collective Identities Underlying Crowdsourced Dataset Annotation

Human annotated data plays a crucial role in machine learning (ML) research and development. However, the ethical considerations around the processes and decisions that go into dataset annotation have not received nearly enough attention.…

Human-Computer Interaction · Computer Science 2022-06-22 Mark Diaz , Ian D. Kivlichan , Rachel Rosen , Dylan K. Baker , Razvan Amironesei , Vinodkumar Prabhakaran , Emily Denton

Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization

While generalization over tasks from easy to hard is crucial to profile language models (LLMs), the datasets with fine-grained difficulty annotations for each problem across a broad range of complexity are still blank. Aiming to address…

Machine Learning · Computer Science 2025-06-10 Mucong Ding , Chenghao Deng , Jocelyn Choo , Zichu Wu , Aakriti Agrawal , Avi Schwarzschild , Tianyi Zhou , Tom Goldstein , John Langford , Anima Anandkumar , Furong Huang

COMI-LINGUA: Expert Annotated Large-Scale Dataset for Multitask NLP in Hindi-English Code-Mixing

We introduce COMI-LINGUA, the largest manually annotated Hindi-English code-mixed dataset, comprising 125K+ high-quality instances across five core NLP tasks: Matrix Language Identification, Token-level Language Identification,…

Computation and Language · Computer Science 2025-09-18 Rajvee Sheth , Himanshu Beniwal , Mayank Singh

CrowdAgent: Multi-Agent Managed Multi-Source Annotation System

High-quality annotated data is a cornerstone of modern Natural Language Processing (NLP). While recent methods begin to leverage diverse annotation sources-including Large Language Models (LLMs), Small Language Models (SLMs), and human…

Artificial Intelligence · Computer Science 2025-09-18 Maosheng Qin , Renyu Zhu , Mingxuan Xia , Chenkai Chen , Zhen Zhu , Minmin Lin , Junbo Zhao , Lu Xu , Changjie Fan , Runze Wu , Haobo Wang

Towards Better Multi-task Learning: A Framework for Optimizing Dataset Combinations in Large Language Models

To efficiently select optimal dataset combinations for enhancing multi-task learning (MTL) performance in large language models, we proposed a novel framework that leverages a neural network to predict the best dataset combinations. The…

Computation and Language · Computer Science 2025-05-06 Zaifu Zhan , Rui Zhang

TACO: Topics in Algorithmic COde generation dataset

We introduce TACO, an open-source, large-scale code generation dataset, with a focus on the optics of algorithms, designed to provide a more challenging training dataset and evaluation benchmark in the field of code generation models. TACO…

Artificial Intelligence · Computer Science 2023-12-29 Rongao Li , Jie Fu , Bo-Wen Zhang , Tao Huang , Zhihong Sun , Chen Lyu , Guang Liu , Zhi Jin , Ge Li

Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks

We present Bloom Library, a linguistically diverse set of multimodal and multilingual datasets for language modeling, image captioning, visual storytelling, and speech synthesis/recognition. These datasets represent either the most, or…

Computation and Language · Computer Science 2022-10-27 Colin Leong , Joshua Nemecek , Jacob Mansdorfer , Anna Filighera , Abraham Owodunni , Daniel Whitenack