English
Related papers

Related papers: tasksource: A Dataset Harmonization Framework for …

200 papers

The scale, variety, and quantity of publicly-available NLP datasets has grown rapidly as researchers propose new tasks, larger models, and novel benchmarks. Datasets is a community library for contemporary NLP designed to support this…

Empirical natural language processing (NLP) systems in application domains (e.g., healthcare, finance, education) involve interoperation among multiple components, ranging from data ingestion, human annotation, to text retrieval, analysis,…

Developing documentation guidelines and easy-to-use templates for datasets and models is a challenging task, especially given the variety of backgrounds, skills, and incentives of the people involved in the building of natural language…

In this paper, we introduce HugNLP, a unified and comprehensive library for natural language processing (NLP) with the prevalent backend of HuggingFace Transformers, which is designed for NLP researchers to easily utilize off-the-shelf…

Computation and Language · Computer Science 2023-03-01 Jianing Wang , Nuo Chen , Qiushi Sun , Wenkang Huang , Chengyu Wang , Ming Gao

Advances in machine learning are closely tied to the creation of datasets. While data documentation is widely recognized as essential to the reliability, reproducibility, and transparency of ML, we lack a systematic empirical understanding…

Machine Learning · Computer Science 2024-01-26 Xinyu Yang , Weixin Liang , James Zou

Timely and effective response to humanitarian crises requires quick and accurate analysis of large amounts of text data - a process that can highly benefit from expert-assisted NLP systems trained on validated and annotated data in the…

Computation and Language · Computer Science 2022-11-08 Selim Fekih , Nicolò Tamagnone , Benjamin Minixhofer , Ranjan Shrestha , Ximena Contla , Ewan Oglethorpe , Navid Rekabsaz

We present TaskSet, a dataset of tasks for use in training and evaluating optimizers. TaskSet is unique in its size and diversity, containing over a thousand tasks ranging from image classification with fully connected or convolutional…

Machine Learning · Computer Science 2020-04-02 Luke Metz , Niru Maheswaranathan , Ruoxi Sun , C. Daniel Freeman , Ben Poole , Jascha Sohl-Dickstein

Pre-trained language models (PTLMs) have transformed natural language processing (NLP), enabling major advances in tasks such as text generation and translation. Similar to software package management, PTLMs are developed using code and…

Software Engineering · Computer Science 2026-01-27 Adekunle Ajibode , Abdul Ali Bangash , Oussama Ben Sghaier , Bram Adams , Ahmed E. Hassan

Crowdsourcing has been the prevalent paradigm for creating natural language understanding datasets in recent years. A common crowdsourcing practice is to recruit a small number of high-quality workers, and have them massively generate…

Computation and Language · Computer Science 2019-08-29 Mor Geva , Yoav Goldberg , Jonathan Berant

In this paper, we introduce the MLM (Multiple Languages and Modalities) dataset - a new resource to train and evaluate multitask systems on samples in multiple modalities and three languages. The generation process and inclusion of semantic…

Machine Learning · Computer Science 2020-10-27 Jason Armitage , Endri Kacupaj , Golsa Tahmasebzadeh , Swati , Maria Maleshkova , Ralph Ewerth , Jens Lehmann

Nowadays, the fields of code and natural language processing are evolving rapidly. In particular, models become better at processing long context windows - supported context sizes have increased by orders of magnitude over the last few…

The rapid growth of open source machine learning (ML) resources, such as models and datasets, has accelerated IR research. However, existing platforms like Hugging Face do not explicitly utilize structured representations, limiting advanced…

Information Retrieval · Computer Science 2025-05-26 Qiaosheng Chen , Kaijia Huang , Xiao Zhou , Weiqing Luo , Yuanning Cui , Gong Cheng

Pre-trained Transformer models have achieved successes in a wide range of NLP tasks, but are inefficient when dealing with long input sequences. Existing studies try to overcome this challenge via segmenting the long sequence followed by…

Computation and Language · Computer Science 2022-03-16 Xiangyang Mou , Mo Yu , Bingsheng Yao , Lifu Huang

Human annotated data plays a crucial role in machine learning (ML) research and development. However, the ethical considerations around the processes and decisions that go into dataset annotation have not received nearly enough attention.…

Human-Computer Interaction · Computer Science 2022-06-22 Mark Diaz , Ian D. Kivlichan , Rachel Rosen , Dylan K. Baker , Razvan Amironesei , Vinodkumar Prabhakaran , Emily Denton

While generalization over tasks from easy to hard is crucial to profile language models (LLMs), the datasets with fine-grained difficulty annotations for each problem across a broad range of complexity are still blank. Aiming to address…

We introduce COMI-LINGUA, the largest manually annotated Hindi-English code-mixed dataset, comprising 125K+ high-quality instances across five core NLP tasks: Matrix Language Identification, Token-level Language Identification,…

Computation and Language · Computer Science 2025-09-18 Rajvee Sheth , Himanshu Beniwal , Mayank Singh

High-quality annotated data is a cornerstone of modern Natural Language Processing (NLP). While recent methods begin to leverage diverse annotation sources-including Large Language Models (LLMs), Small Language Models (SLMs), and human…

Artificial Intelligence · Computer Science 2025-09-18 Maosheng Qin , Renyu Zhu , Mingxuan Xia , Chenkai Chen , Zhen Zhu , Minmin Lin , Junbo Zhao , Lu Xu , Changjie Fan , Runze Wu , Haobo Wang

To efficiently select optimal dataset combinations for enhancing multi-task learning (MTL) performance in large language models, we proposed a novel framework that leverages a neural network to predict the best dataset combinations. The…

Computation and Language · Computer Science 2025-05-06 Zaifu Zhan , Rui Zhang

We introduce TACO, an open-source, large-scale code generation dataset, with a focus on the optics of algorithms, designed to provide a more challenging training dataset and evaluation benchmark in the field of code generation models. TACO…

Artificial Intelligence · Computer Science 2023-12-29 Rongao Li , Jie Fu , Bo-Wen Zhang , Tao Huang , Zhihong Sun , Chen Lyu , Guang Liu , Zhi Jin , Ge Li

We present Bloom Library, a linguistically diverse set of multimodal and multilingual datasets for language modeling, image captioning, visual storytelling, and speech synthesis/recognition. These datasets represent either the most, or…

Computation and Language · Computer Science 2022-10-27 Colin Leong , Joshua Nemecek , Jacob Mansdorfer , Anna Filighera , Abraham Owodunni , Daniel Whitenack
‹ Prev 1 2 3 10 Next ›