Related papers: Predictions For Pre-training Language Models

Revisiting Self-Training for Few-Shot Learning of Language Model

As unlabeled data carry rich task-relevant information, they are proven useful for few-shot learning of language model. The question is how to effectively make use of such data. In this work, we revisit the self-training technique for…

Computation and Language · Computer Science 2021-10-05 Yiming Chen , Yan Zhang , Chen Zhang , Grandee Lee , Ran Cheng , Haizhou Li

Uncertainty-aware Self-training for Text Classification with Few Labels

Recent success of large-scale pre-trained language models crucially hinge on fine-tuning them on large amounts of labeled data for the downstream task, that are typically expensive to acquire. In this work, we study self-training as one of…

Computation and Language · Computer Science 2020-06-30 Subhabrata Mukherjee , Ahmed Hassan Awadallah

Self-training Improves Pre-training for Natural Language Understanding

Unsupervised pre-training has led to much recent progress in natural language understanding. In this paper, we study self-training as another way to leverage unlabeled data through semi-supervised learning. To obtain additional data for a…

Computation and Language · Computer Science 2020-10-06 Jingfei Du , Edouard Grave , Beliz Gunel , Vishrav Chaudhary , Onur Celebi , Michael Auli , Ves Stoyanov , Alexis Conneau

Profile Prediction: An Alignment-Based Pre-Training Task for Protein Sequence Models

For protein sequence datasets, unlabeled data has greatly outpaced labeled data due to the high cost of wet-lab characterization. Recent deep-learning approaches to protein prediction have shown that pre-training on unlabeled data can yield…

Machine Learning · Computer Science 2020-12-02 Pascal Sturmfels , Jesse Vig , Ali Madani , Nazneen Fatema Rajani

A Survey on Self-supervised Pre-training for Sequential Transfer Learning in Neural Networks

Deep neural networks are typically trained under a supervised learning framework where a model learns a single task using labeled data. Instead of relying solely on labeled data, practitioners can harness unlabeled or related data to…

Machine Learning · Computer Science 2020-07-03 Huanru Henry Mao

Self-Training for Sample-Efficient Active Learning for Text Classification with Pre-Trained Language Models

Active learning is an iterative labeling process that is used to obtain a small labeled subset, despite the absence of labeled data, thereby enabling to train a model for supervised tasks such as text classification. While active learning…

Computation and Language · Computer Science 2024-10-07 Christopher Schröder , Gerhard Heyer

PTUM: Pre-training User Model from Unlabeled User Behaviors via Self-supervision

User modeling is critical for many personalized web services. Many existing methods model users based on their behaviors and the labeled data of target tasks. However, these methods cannot exploit useful information in unlabeled user…

Information Retrieval · Computer Science 2020-10-06 Chuhan Wu , Fangzhao Wu , Tao Qi , Jianxun Lian , Yongfeng Huang , Xing Xie

Entailment as Robust Self-Learner

Entailment has been recognized as an important metric for evaluating natural language understanding (NLU) models, and recent studies have found that entailment pretraining benefits weakly supervised fine-tuning. In this work, we design a…

Computation and Language · Computer Science 2023-05-30 Jiaxin Ge , Hongyin Luo , Yoon Kim , James Glass

Enhancing Self-Training Methods

Semi-supervised learning approaches train on small sets of labeled data along with large sets of unlabeled data. Self-training is a semi-supervised teacher-student approach that often suffers from the problem of "confirmation bias" that…

Machine Learning · Computer Science 2023-01-19 Aswathnarayan Radhakrishnan , Jim Davis , Zachary Rabin , Benjamin Lewis , Matthew Scherreik , Roman Ilin

Neural Semi-supervised Learning for Text Classification Under Large-Scale Pretraining

The goal of semi-supervised learning is to utilize the unlabeled, in-domain dataset U to improve models trained on the labeled dataset D. Under the context of large-scale language-model (LM) pretraining, how we can make the best use of U is…

Computation and Language · Computer Science 2020-11-20 Zijun Sun , Chun Fan , Xiaofei Sun , Yuxian Meng , Fei Wu , Jiwei Li

Revisiting Pretraining for Semi-Supervised Learning in the Low-Label Regime

Semi-supervised learning (SSL) addresses the lack of labeled data by exploiting large unlabeled data through pseudolabeling. However, in the extremely low-label regime, pseudo labels could be incorrect, a.k.a. the confirmation bias, and the…

Computer Vision and Pattern Recognition · Computer Science 2022-05-09 Xun Xu , Jingyi Liao , Lile Cai , Manh Cuong Nguyen , Kangkang Lu , Wanyue Zhang , Yasin Yazici , Chuan Sheng Foo

Neural Networks Against (and For) Self-Training: Classification with Small Labeled and Large Unlabeled Sets

We propose a semi-supervised text classifier based on self-training using one positive and one negative property of neural networks. One of the weaknesses of self-training is the semantic drift problem, where noisy pseudo-labels accumulate…

Computation and Language · Computer Science 2024-01-02 Payam Karisani

Revisiting Self-Training for Neural Sequence Generation

Self-training is one of the earliest and simplest semi-supervised methods. The key idea is to augment the original labeled dataset with unlabeled data paired with the model's prediction (i.e. the pseudo-parallel data). While self-training…

Machine Learning · Computer Science 2020-10-20 Junxian He , Jiatao Gu , Jiajun Shen , Marc'Aurelio Ranzato

Optimising Language Models for Downstream Tasks: A Post-Training Perspective

Language models (LMs) have demonstrated remarkable capabilities in NLP, yet adapting them efficiently and robustly to specific tasks remains challenging. As their scale and complexity grow, fine-tuning LMs on labelled data often…

Computation and Language · Computer Science 2025-06-27 Zhengyan Shi

Don't Stop Pretraining? Make Prompt-based Fine-tuning Powerful Learner

Language models (LMs) trained on vast quantities of unlabelled data have greatly advanced the field of natural language processing (NLP). In this study, we re-visit the widely accepted notion in NLP that continued pre-training LMs on…

Computation and Language · Computer Science 2023-10-09 Zhengxiang Shi , Aldo Lipani

AcTune: Uncertainty-aware Active Self-Training for Semi-Supervised Active Learning with Pretrained Language Models

While pre-trained language model (PLM) fine-tuning has achieved strong performance in many NLP tasks, the fine-tuning stage can be still demanding in labeled data. Recent works have resorted to active fine-tuning to improve the label…

Computation and Language · Computer Science 2022-05-04 Yue Yu , Lingkai Kong , Jieyu Zhang , Rongzhi Zhang , Chao Zhang

Big Self-Supervised Models are Strong Semi-Supervised Learners

One paradigm for learning from few labeled examples while making best use of a large amount of unlabeled data is unsupervised pretraining followed by supervised fine-tuning. Although this paradigm uses unlabeled data in a task-agnostic way,…

Machine Learning · Computer Science 2020-10-27 Ting Chen , Simon Kornblith , Kevin Swersky , Mohammad Norouzi , Geoffrey Hinton

LST: Lexicon-Guided Self-Training for Few-Shot Text Classification

Self-training provides an effective means of using an extremely small amount of labeled data to create pseudo-labels for unlabeled data. Many state-of-the-art self-training approaches hinge on different regularization methods to prevent…

Computation and Language · Computer Science 2022-02-08 Hazel Kim , Jaeman Son , Yo-Sub Han

Unlock the Power of Unlabeled Data in Language Driving Model

Recent Vision-based Large Language Models~(VisionLLMs) for autonomous driving have seen rapid advancements. However, such promotion is extremely dependent on large-scale high-quality annotated data, which is costly and labor-intensive. To…

Computer Vision and Pattern Recognition · Computer Science 2025-03-18 Chaoqun Wang , Jie Yang , Xiaobin Hong , Ruimao Zhang

Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training

Self-supervised learning of speech representations has been a very active research area but most work is focused on a single domain such as read audio books for which there exist large quantities of labeled and unlabeled data. In this…

Sound · Computer Science 2021-09-09 Wei-Ning Hsu , Anuroop Sriram , Alexei Baevski , Tatiana Likhomanenko , Qiantong Xu , Vineel Pratap , Jacob Kahn , Ann Lee , Ronan Collobert , Gabriel Synnaeve , Michael Auli