Related papers: Label-Efficient Self-Training for Attribute Extrac…
Active learning is an iterative labeling process that is used to obtain a small labeled subset, despite the absence of labeled data, thereby enabling to train a model for supervised tasks such as text classification. While active learning…
Self-training provides an effective means of using an extremely small amount of labeled data to create pseudo-labels for unlabeled data. Many state-of-the-art self-training approaches hinge on different regularization methods to prevent…
A visually rich document (VRD) utilizes visual features along with linguistic cues to disseminate information. Training a custom extractor that identifies named entities from a document requires a large number of instances of the target…
Recently many research efforts have been devoted to image annotation by leveraging on the associated tags/keywords of web images as training labels. A key issue to resolve is the relatively low accuracy of the tags. In this paper, we…
The web contains countless semi-structured websites, which can be a rich source of information for populating knowledge bases. Existing methods for extracting relations from the DOM trees of semi-structured webpages can achieve high…
Semi-supervised learning provides a solution to reduce the dependency of machine learning on labeled data. As one of the efficient semi-supervised techniques, self-training (ST) has received increasing attention. Several advancements have…
Recent success of large-scale pre-trained language models crucially hinge on fine-tuning them on large amounts of labeled data for the downstream task, that are typically expensive to acquire. In this work, we study self-training as one of…
Text categorization is an essential task in Web content analysis. Considering the ever-evolving Web data and new emerging categories, instead of the laborious supervised setting, in this paper, we focus on the minimally-supervised setting…
Self-training via pseudo labeling is a conventional, simple, and popular pipeline to leverage unlabeled data. In this work, we first construct a strong baseline of self-training (namely ST) for semi-supervised semantic segmentation via…
To alleviate human efforts from obtaining large-scale annotations, Semi-Supervised Relation Extraction methods aim to leverage unlabeled data in addition to learning from limited samples. Existing self-training methods suffer from the…
Given an unlabeled dataset and an annotation budget, we study how to selectively label a fixed number of instances so that semi-supervised learning (SSL) on such a partially labeled dataset is most effective. We focus on selecting the right…
Reducing the amount of labels required to train convolutional neural networks without performance degradation is key to effectively reduce human annotation efforts. We propose Reliable Label Bootstrapping (ReLaB), an unsupervised…
Semi-supervised learning (SSL) can reduce the need for large labelled datasets by incorporating unlabelled data into the training. This is particularly interesting for semantic segmentation, where labelling data is very costly and…
State-of-the-art deep neural networks require large-scale labeled training data that is often expensive to obtain or not available for many tasks. Weak supervision in the form of domain-specific rules has been shown to be useful in such…
Sequence labeling is an important technique employed for many Natural Language Processing (NLP) tasks, such as Named Entity Recognition (NER), slot tagging for dialog systems and semantic parsing. Large-scale pre-trained language models…
Recent advances in semi-supervised learning (SSL) demonstrate that a combination of consistency regularization and pseudo-labeling can effectively improve image classification accuracy in the low-data regime. Compared to classification,…
Semi-structured content in HTML tables, lists, and infoboxes accounts for a substantial share of factual data on the web, yet the formatting complicates usage, and reliably extracting structured information from them remains challenging.…
Few-shot classification (FSC) is challenging due to the scarcity of labeled training data (e.g. only one labeled data point per class). Meta-learning has shown to achieve promising results by learning to initialize a classification model…
Dataset pruning reduces the storage and training costs of deep learning by selecting an informative subset from a large dataset. However, most existing pruning methods require fully labeled data, which limits their applicability in…
We propose a semi-supervised text classifier based on self-training using one positive and one negative property of neural networks. One of the weaknesses of self-training is the semantic drift problem, where noisy pseudo-labels accumulate…