Related papers: The MERIT Dataset: Modelling and Efficiently Rende…

MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query

Semantic retrieval is crucial for modern applications yet remains underexplored in current research. Existing datasets are limited to single languages, single images, or singular retrieval conditions, often failing to fully exploit the…

Computer Vision and Pattern Recognition · Computer Science 2025-12-04 Wei Chow , Yuan Gao , Linfeng Li , Xian Wang , Qi Xu , Hang Song , Lingdong Kong , Ran Zhou , Yi Zeng , Yidong Cai , Botian Jiang , Shilin Xu , Jiajun Zhang , Minghui Qiu , Xiangtai Li , Tianshu Yang , Siliang Tang , Juncheng Li

WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning

The milestone improvements brought about by deep representation learning and pre-training techniques have led to large performance gains across downstream NLP, IR and Vision tasks. Multimodal modeling techniques aim to leverage large…

Computer Vision and Pattern Recognition · Computer Science 2023-02-21 Krishna Srinivasan , Karthik Raman , Jiecao Chen , Michael Bendersky , Marc Najork

MLM: A Benchmark Dataset for Multitask Learning with Multiple Languages and Modalities

In this paper, we introduce the MLM (Multiple Languages and Modalities) dataset - a new resource to train and evaluate multitask systems on samples in multiple modalities and three languages. The generation process and inclusion of semantic…

Machine Learning · Computer Science 2020-10-27 Jason Armitage , Endri Kacupaj , Golsa Tahmasebzadeh , Swati , Maria Maleshkova , Ralph Ewerth , Jens Lehmann

Position: Measure Dataset Diversity, Don't Just Claim It

Machine learning (ML) datasets, often perceived as neutral, inherently encapsulate abstract and disputed social constructs. Dataset curators frequently employ value-laden terms such as diversity, bias, and quality to characterize datasets.…

Machine Learning · Computer Science 2024-07-12 Dora Zhao , Jerone T. A. Andrews , Orestis Papakyriakopoulos , Alice Xiang

DocumentNet: Bridging the Data Gap in Document Pre-Training

Document understanding tasks, in particular, Visually-rich Document Entity Retrieval (VDER), have gained significant attention in recent years thanks to their broad applications in enterprise AI. However, publicly available data have been…

Computation and Language · Computer Science 2023-10-27 Lijun Yu , Jin Miao , Xiaoyu Sun , Jiayi Chen , Alexander G. Hauptmann , Hanjun Dai , Wei Wei

Multimodal Lecture Presentations Dataset: Understanding Multimodality in Educational Slides

Lecture slide presentations, a sequence of pages that contain text and figures accompanied by speech, are constructed and presented carefully in order to optimally transfer knowledge to students. Previous studies in multimedia and…

Artificial Intelligence · Computer Science 2022-08-18 Dong Won Lee , Chaitanya Ahuja , Paul Pu Liang , Sanika Natu , Louis-Philippe Morency

Exploring the Efficacy of Meta-Learning: Unveiling Superior Data Diversity Utilization of MAML Over Pre-training

Currently, data and model size dominate the narrative in the training of super-large, powerful models. However, there has been a lack of exploration on the effect of other attributes of the training dataset on model performance. We…

Machine Learning · Computer Science 2025-01-22 Kavita Selva , Satita Vittayaareekul , Brando Miranda

EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational Scenarios

As large language models continue to advance, their application in educational contexts remains underexplored and under-optimized. In this paper, we address this gap by introducing the first diverse benchmark tailored for educational…

Computation and Language · Computer Science 2026-01-07 Bin Xu , Yu Bai , Huashan Sun , Yiguan Lin , Siming Liu , Xinyue Liang , Yaolin Li , Zhuangzhi Dong , Jingren Zhang , Yufan Deng , Xinyu Zou , Yang Gao , Heyan Huang

Towards Unified Music Emotion Recognition across Dimensional and Categorical Models

One of the most significant challenges in Music Emotion Recognition (MER) comes from the fact that emotion labels can be heterogeneous across datasets with regard to the emotion representation, including categorical (e.g., happy, sad)…

Sound · Computer Science 2025-04-14 Jaeyong Kang , Dorien Herremans

DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling

Multimodal reward models (MRMs) play a crucial role in aligning Multimodal Large Language Models (MLLMs) with human preferences. Training a good MRM requires high-quality multimodal preference data. However, existing preference datasets…

Artificial Intelligence · Computer Science 2026-04-22 Zhihong Zhang , Jie Zhao , Xiaojian Huang , Jin Xu , Zhuodong Luo , Xin Liu , Jiansheng Wei , Xuejin Chen

SPECTER: Document-level Representation Learning using Citation-informed Transformers

Representation learning is a critical ingredient for natural language processing systems. Recent Transformer language models like BERT learn powerful textual representations, but these models are targeted towards token- and sentence-level…

Computation and Language · Computer Science 2020-05-21 Arman Cohan , Sergey Feldman , Iz Beltagy , Doug Downey , Daniel S. Weld

MERIT: Memory-Enhanced Retrieval for Interpretable Knowledge Tracing

Knowledge Tracing (KT) models students' evolving knowledge states to predict future performance, serving as a foundation for personalized education. While traditional deep learning models achieve high accuracy, they often lack…

Computation and Language · Computer Science 2026-03-25 Runze Li , Kedi Chen , Guwei Feng , Mo Yu , Jun Wang , Wei Zhang

On the Evaluation and Refinement of Vision-Language Instruction Tuning Datasets

There is an emerging line of research on multimodal instruction tuning, and a line of benchmarks has been proposed for evaluating these models recently. Instead of evaluating the models directly, in this paper, we try to evaluate the…

Computer Vision and Pattern Recognition · Computer Science 2024-01-02 Ning Liao , Shaofeng Zhang , Renqiu Xia , Min Cao , Yu Qiao , Junchi Yan

A Data Fusion Framework for Multi-Domain Morality Learning

Language models can be trained to recognize the moral sentiment of text, creating new opportunities to study the role of morality in human life. As interest in language and morality has grown, several ground truth datasets with moral…

Computation and Language · Computer Science 2023-04-06 Siyi Guo , Negar Mokhberian , Kristina Lerman

MERIT: Multi-view evidential learning for reliable and interpretable liver fibrosis staging

Accurate staging of liver fibrosis from magnetic resonance imaging (MRI) is crucial in clinical practice. While conventional methods often focus on a specific sub-region, multi-view learning captures more information by analyzing multiple…

Computer Vision and Pattern Recognition · Computer Science 2025-03-04 Yuanye Liu , Zheyao Gao , Nannan Shi , Fuping Wu , Yuxin Shi , Qingchao Chen , Xiahai Zhuang

MERIT: Matching Expertise via Rubric-Informed Training for Reviewer Assignment

Matching submissions with suitable reviewers at scale is a growing challenge for major venues, yet existing approaches either rely on coarse proxy signals that conflate general relatedness with true suitability, or require expensive human…

Computation and Language · Computer Science 2026-05-28 Zixuan Yang , Yibo Zhao , Weicong Liu , Xiang Li

Perceptual Score: What Data Modalities Does Your Model Perceive?

Machine learning advances in the last decade have relied significantly on large-scale datasets that continue to grow in size. Increasingly, those datasets also contain different data modalities. However, large multi-modal datasets are hard…

Machine Learning · Computer Science 2021-10-28 Itai Gat , Idan Schwartz , Alexander Schwing

Imbalanced Multi-label Classification for Business-related Text with Moderately Large Label Spaces

In this study, we compared the performance of four different methods for multi label text classification using a specific imbalanced business dataset. The four methods we evaluated were fine tuned BERT, Binary Relevance, Classifier Chains,…

Information Retrieval · Computer Science 2023-06-13 Muhammad Arslan , Christophe Cruz

A Novel Multidimensional Reference Model For Heterogeneous Textual Datasets Using Context, Semantic And Syntactic Clues

With the advent of technology and use of latest devices, they produces voluminous data. Out of it, 80% of the data are unstructured and remaining 20% are structured and semi-structured. The produced data are in heterogeneous format and…

Software Engineering · Computer Science 2023-11-13 Ganesh Kumar , Shuib Basri , Abdullahi Abubakar Imam , Abdullateef Oluwaqbemiga Balogun , Hussaini Mamman , Luiz Fernando Capretz

Datasets for Large Language Models: A Comprehensive Survey

This paper embarks on an exploration into the Large Language Model (LLM) datasets, which play a crucial role in the remarkable advancements of LLMs. The datasets serve as the foundational infrastructure analogous to a root system that…

Computation and Language · Computer Science 2024-02-29 Yang Liu , Jiahuan Cao , Chongyu Liu , Kai Ding , Lianwen Jin