Related papers: Annotations Mitigate Post-Training Mode Collapse

Auto-Annotation Quality Prediction for Semi-Supervised Learning with Ensembles

Auto-annotation by ensemble of models is an efficient method of learning on unlabeled data. Wrong or inaccurate annotations generated by the ensemble may lead to performance degradation of the trained model. To deal with this problem we…

Computer Vision and Pattern Recognition · Computer Science 2024-03-14 Dror Simon , Miriam Farber , Roman Goldenberg

Unveiling the Multi-Annotation Process: Examining the Influence of Annotation Quantity and Instance Difficulty on Model Performance

The NLP community has long advocated for the construction of multi-annotator datasets to better capture the nuances of language interpretation, subjectivity, and ambiguity. This paper conducts a retrospective study to show how performance…

Computation and Language · Computer Science 2023-10-24 Pritam Kadasi , Mayank Singh

On Incorporating Semantic Prior Knowledge in Deep Learning Through Embedding-Space Constraints

The knowledge that humans hold about a problem often extends far beyond a set of training data and output labels. While the success of deep learning mostly relies on supervised training, important properties cannot be inferred efficiently…

Computer Vision and Pattern Recognition · Computer Science 2019-11-19 Damien Teney , Ehsan Abbasnejad , Anton van den Hengel

Pre-Trained Vision-Language Models as Partial Annotators

Pre-trained vision-language models learn massive data to model unified representations of images and natural languages, which can be widely applied to downstream machine learning tasks. In addition to zero-shot inference, in order to better…

Computer Vision and Pattern Recognition · Computer Science 2024-06-28 Qian-Wei Wang , Yuqiu Xie , Letian Zhang , Zimo Liu , Shu-Tao Xia

Learning Fast Matching Models from Weak Annotations

This paper proposes a novel training scheme for fast matching models in Search Ads, which is motivated by the real challenges in model training. The first challenge stems from the pursuit of high throughput, which prohibits the deployment…

Information Retrieval · Computer Science 2019-04-23 Xue Li , Zhipeng Luo , Hao Sun , Jianjin Zhang , Weihao Han , Xianqi Chu , Liangjie Zhang , Qi Zhang

UmBERTo-MTSA @ AcCompl-It: Improving Complexity and Acceptability Prediction with Multi-task Learning on Self-Supervised Annotations

This work describes a self-supervised data augmentation approach used to improve learning models' performances when only a moderate amount of labeled data is available. Multiple copies of the original model are initially trained on the…

Computation and Language · Computer Science 2020-12-18 Gabriele Sarti

Multi-dataset Pretraining: A Unified Model for Semantic Segmentation

Collecting annotated data for semantic segmentation is time-consuming and hard to scale up. In this paper, we for the first time propose a unified framework, termed as Multi-Dataset Pretraining, to take full advantage of the fragmented…

Computer Vision and Pattern Recognition · Computer Science 2021-06-09 Bowen Shi , Xiaopeng Zhang , Haohang Xu , Wenrui Dai , Junni Zou , Hongkai Xiong , Qi Tian

SED-SFT: Selectively Encouraging Diversity in Supervised Fine-Tuning

Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has emerged as the standard post-training paradigm for large language models (LLMs). However, the conventional SFT process, driven by Cross-Entropy (CE) loss, often…

Computation and Language · Computer Science 2026-02-10 Yijie Chen , Yijin Liu , Fandong Meng

Improving Task Diversity in Label Efficient Supervised Finetuning of LLMs

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, but developing high-performing models for specialized applications often requires substantial human annotation -- a process that is…

Computation and Language · Computer Science 2025-07-30 Abhinav Arabelly , Jagrut Nemade , Robert D Nowak , Jifan Zhang

Learning from Imperfect Annotations

Many machine learning systems today are trained on large amounts of human-annotated data. Data annotation tasks that require a high level of competency make data acquisition expensive, while the resulting labels are often subjective,…

Machine Learning · Computer Science 2020-04-08 Emmanouil Antonios Platanios , Maruan Al-Shedivat , Eric Xing , Tom Mitchell

Active Finetuning: Exploiting Annotation Budget in the Pretraining-Finetuning Paradigm

Given the large-scale data and the high annotation cost, pretraining-finetuning becomes a popular paradigm in multiple computer vision tasks. Previous research has covered both the unsupervised pretraining and supervised finetuning in this…

Computer Vision and Pattern Recognition · Computer Science 2023-03-28 Yichen Xie , Han Lu , Junchi Yan , Xiaokang Yang , Masayoshi Tomizuka , Wei Zhan

Optimizing Active Learning for Low Annotation Budgets

When we can not assume a large amount of annotated data , active learning is a good strategy. It consists in learning a model on a small amount of annotated data (annotation budget) and in choosing the best set of points to annotate in…

Computer Vision and Pattern Recognition · Computer Science 2022-01-19 Umang Aggarwal , Adrian Popescu , Céline Hudelot

Large Language Models as Annotators: Enhancing Generalization of NLP Models at Minimal Cost

State-of-the-art supervised NLP models achieve high accuracy but are also susceptible to failures on inputs from low-data regimes, such as domains that are not represented in training data. As an approximation to collecting ground-truth…

Computation and Language · Computer Science 2023-06-29 Parikshit Bansal , Amit Sharma

Improving Span-based Question Answering Systems with Coarsely Labeled Data

We study approaches to improve fine-grained short answer Question Answering models by integrating coarse-grained data annotated for paragraph-level relevance and show that coarsely annotated data can bring significant performance gains.…

Computation and Language · Computer Science 2018-11-07 Hao Cheng , Ming-Wei Chang , Kenton Lee , Ankur Parikh , Michael Collins , Kristina Toutanova

Multi-utility Learning: Structured-output Learning with Multiple Annotation-specific Loss Functions

Structured-output learning is a challenging problem; particularly so because of the difficulty in obtaining large datasets of fully labelled instances for training. In this paper we try to overcome this difficulty by presenting a…

Computer Vision and Pattern Recognition · Computer Science 2014-06-24 Roman Shapovalov , Dmitry Vetrov , Anton Osokin , Pushmeet Kohli

Super-Prompting: Utilizing Model-Independent Contextual Data to Reduce Data Annotation Required in Visual Commonsense Tasks

Pre-trained language models have shown excellent results in few-shot learning scenarios using in-context learning. Although it is impressive, the size of language models can be prohibitive to make them usable in on-device applications, such…

Computation and Language · Computer Science 2022-04-27 Navid Rezaei , Marek Z. Reformat

Mapping Post-Training Forgetting in Language Models at Scale

Scaled post-training now drives many of the largest capability gains in language models (LMs), yet its effect on pretrained knowledge remains poorly understood. Not all forgetting is equal: Forgetting one fact (e.g., a U.S. president or an…

Machine Learning · Computer Science 2025-10-21 Jackson Harmon , Andreas Hochlehnert , Matthias Bethge , Ameya Prabhu

A Post-Training Enhanced Optimization Approach for Small Language Models

This paper delves into the continuous post-training optimization methods for small language models, and proposes a continuous post-training alignment data construction method for small language models. The core of this method is based on…

Computation and Language · Computer Science 2024-12-24 Keke Zhai

Prototype-Anchored Learning for Learning with Imperfect Annotations

The success of deep neural networks greatly relies on the availability of large amounts of high-quality annotated data, which however are difficult or expensive to obtain. The resulting labels may be class imbalanced, noisy or human biased.…

Machine Learning · Computer Science 2022-06-24 Xiong Zhou , Xianming Liu , Deming Zhai , Junjun Jiang , Xin Gao , Xiangyang Ji

The Impact of Annotation Guidelines and Annotated Data on Extracting App Features from App Reviews

Annotation guidelines used to guide the annotation of training and evaluation datasets can have a considerable impact on the quality of machine learning models. In this study, we explore the effects of annotation guidelines on the quality…

Information Retrieval · Computer Science 2018-10-15 Faiz Ali Shah , Kairit Sirts , Dietmar Pfahl