Related papers: Data Programming: Creating Large Training Sets, Qu…

Data Programming using Continuous and Quality-Guided Labeling Functions

Scarcity of labeled data is a bottleneck for supervised learning models. A paradigm that has evolved for dealing with this problem is data programming. An existing data programming paradigm allows human supervision to be provided as a set…

Machine Learning · Computer Science 2019-11-25 Oishik Chatterjee , Ganesh Ramakrishnan , Sunita Sarawagi

The Word is Mightier than the Label: Learning without Pointillistic Labels using Data Programming

Most advanced supervised Machine Learning (ML) models rely on vast amounts of point-by-point labelled training examples. Hand-labelling vast amounts of data may be tedious, expensive, and error-prone. Recently, some studies have explored…

Machine Learning · Computer Science 2021-08-27 Chufan Gao , Mononito Goswami

Making Large Language Models Better Data Creators

Although large language models (LLMs) have advanced the state-of-the-art in NLP significantly, deploying them for downstream applications is still challenging due to cost, responsiveness, control, or concerns around privacy and security. As…

Computation and Language · Computer Science 2023-11-01 Dong-Ho Lee , Jay Pujara , Mohit Sewak , Ryen W. White , Sujay Kumar Jauhar

Semi-Supervised Data Programming with Subset Selection

The paradigm of data programming, which uses weak supervision in the form of rules/labelling functions, and semi-supervised learning, which augments small amounts of labelled data with a large unlabelled dataset, have shown great promise in…

Machine Learning · Computer Science 2021-06-15 Ayush Maheshwari , Oishik Chatterjee , KrishnaTeja Killamsetty , Ganesh Ramakrishnan , Rishabh Iyer

Iterative Data Programming for Expanding Text Classification Corpora

Real-world text classification tasks often require many labeled training examples that are expensive to obtain. Recent advancements in machine teaching, specifically the data programming paradigm, facilitate the creation of training data…

Machine Learning · Computer Science 2020-02-05 Neil Mallinar , Abhishek Shah , Tin Kam Ho , Rajendra Ugrani , Ayush Gupta

A Data Management Approach for Dataset Selection Using Human Computation

As the number of applications that use machine learning algorithms increases, the need for labeled data useful for training such algorithms intensifies. Getting labels typically involves employing humans to do the annotation, which directly…

Machine Learning · Computer Science 2013-07-16 Alexandros Ntoulas , Omar Alonso , Vasilis Kandylas

An Empirical Study on Noisy Label Learning for Program Understanding

Recently, deep learning models have been widely applied in program understanding tasks, and these models achieve state-of-the-art results on many benchmark datasets. A major challenge of deep learning for program understanding is that the…

Software Engineering · Computer Science 2024-01-02 Wenhan Wang , Yanzhou Li , Anran Li , Jian Zhang , Wei Ma , Yang Liu

Automating Weak Label Generation for Data Programming with Clinicians in the Loop

Large Deep Neural Networks (DNNs) are often data hungry and need high-quality labeled data in copious amounts for learning to converge. This is a challenge in the field of medicine since high quality labeled data is often scarce. Data…

Machine Learning · Computer Science 2024-07-12 Jean Park , Sydney Pugh , Kaustubh Sridhar , Mengyu Liu , Navish Yarna , Ramneet Kaur , Souradeep Dutta , Elena Bernardis , Oleg Sokolsky , Insup Lee

Cross-Modal Data Programming Enables Rapid Medical Machine Learning

Labeling training datasets has become a key barrier to building medical machine learning models. One strategy is to generate training labels programmatically, for example by applying natural language processing pipelines to text reports…

Machine Learning · Computer Science 2019-03-28 Jared Dunnmon , Alexander Ratner , Nishith Khandwala , Khaled Saab , Matthew Markert , Hersh Sagreiya , Roger Goldman , Christopher Lee-Messer , Matthew Lungren , Daniel Rubin , Christopher Ré

Label Noise Types and Their Effects on Deep Learning

The recent success of deep learning is mostly due to the availability of big datasets with clean annotations. However, gathering a cleanly annotated dataset is not always feasible due to practical challenges. As a result, label noise is a…

Computer Vision and Pattern Recognition · Computer Science 2020-03-25 Görkem Algan , İlkay Ulusoy

Learning to Learn from Noisy Labeled Data

Despite the success of deep neural networks (DNNs) in image classification tasks, the human-level performance relies on massive training data with high-quality manual annotations, which are expensive and time-consuming to collect. There…

Machine Learning · Computer Science 2019-04-15 Junnan Li , Yongkang Wong , Qi Zhao , Mohan Kankanhalli

ActiveDP: Bridging Active Learning and Data Programming

Modern machine learning models require large labelled datasets to achieve good performance, but manually labelling large datasets is expensive and time-consuming. The data programming paradigm enables users to label large datasets…

Machine Learning · Computer Science 2024-02-12 Naiqing Guan , Nick Koudas

Distilling Effective Supervision from Severe Label Noise

Collecting large-scale data with clean labels for supervised training of neural networks is practically challenging. Although noisy labels are usually cheap to acquire, existing methods suffer a lot from label noise. This paper targets at…

Machine Learning · Computer Science 2020-06-16 Zizhao Zhang , Han Zhang , Sercan O. Arik , Honglak Lee , Tomas Pfister

Iterative Label Improvement: Robust Training by Confidence Based Filtering and Dataset Partitioning

State-of-the-art, high capacity deep neural networks not only require large amounts of labelled training data, they are also highly susceptible to label errors in this data, typically resulting in large efforts and costs and therefore…

Machine Learning · Computer Science 2020-07-20 Christian Haase-Schütz , Rainer Stal , Heinz Hertlein , Bernhard Sick

Data Consistency for Weakly Supervised Learning

In many applications, training machine learning models involves using large amounts of human-annotated data. Obtaining precise labels for the data is expensive. Instead, training with weak supervision provides a low-cost alternative. We…

Machine Learning · Computer Science 2022-02-09 Chidubem Arachie , Bert Huang

Pairwise Feedback for Data Programming

The scalability of the labeling process and the attainable quality of labels have become limiting factors for many applications of machine learning. The programmatic creation of labeled datasets via the synthesis of noisy heuristics…

Machine Learning · Computer Science 2019-12-18 Benedikt Boecking , Artur Dubrawski

Toward Robustness against Label Noise in Training Deep Discriminative Neural Networks

Collecting large training datasets, annotated with high-quality labels, is costly and time-consuming. This paper proposes a novel framework for training deep convolutional neural networks from noisy labeled datasets that can be obtained…

Machine Learning · Computer Science 2017-11-06 Arash Vahdat

Learning to Robustly Aggregate Labeling Functions for Semi-supervised Data Programming

A critical bottleneck in supervised machine learning is the need for large amounts of labeled data which is expensive and time consuming to obtain. However, it has been shown that a small amount of labeled data, while insufficient to…

Machine Learning · Computer Science 2022-03-11 Ayush Maheshwari , Krishnateja Killamsetty , Ganesh Ramakrishnan , Rishabh Iyer , Marina Danilevsky , Lucian Popa

The Re-Label Method For Data-Centric Machine Learning

In industry deep learning application, our manually labeled data has a certain number of noisy data. To solve this problem and achieve more than 90 score in dev dataset, we present a simple method to find the noisy data and re-label the…

Machine Learning · Computer Science 2025-03-20 Tong Guo

Learning from Training Dynamics: Identifying Mislabeled Data Beyond Manually Designed Features

While mislabeled or ambiguously-labeled samples in the training set could negatively affect the performance of deep models, diagnosing the dataset and identifying mislabeled samples helps to improve the generalization power. Training…

Computer Vision and Pattern Recognition · Computer Science 2022-12-21 Qingrui Jia , Xuhong Li , Lei Yu , Jiang Bian , Penghao Zhao , Shupeng Li , Haoyi Xiong , Dejing Dou