Related papers: CodeLabeller: A Web-based Code Annotation Tool for…

OneLabeler: A Flexible System for Building Data Labeling Tools

Labeled datasets are essential for supervised machine learning. Various data labeling tools have been built to collect labels in different usage scenarios. However, developing labeling tools is time-consuming, costly, and…

Human-Computer Interaction · Computer Science 2022-03-29 Yu Zhang , Yun Wang , Haidong Zhang , Bin Zhu , Siming Chen , Dongmei Zhang

TagRuler: Interactive Tool for Span-Level Data Programming by Demonstration

Despite rapid developments in the field of machine learning research, collecting high-quality labels for supervised learning remains a bottleneck for many applications. This difficulty is exacerbated by the fact that state-of-the-art models…

Computation and Language · Computer Science 2021-06-25 Dongjin Choi , Sara Evensen , Çağatay Demiralp , Estevam Hruschka

ReviewRanker: A Semi-Supervised Learning Based Approach for Code Review Quality Estimation

Code review is considered a key process in the software industry for minimizing bugs and improving code quality. Inspection of review process effectiveness and continuous improvement can boost development productivity. Such inspection is a…

Software Engineering · Computer Science 2023-07-11 Saifullah Mahbub , Md. Easin Arafat , Chowdhury Rafeed Rahman , Zannatul Ferdows , Masum Hasan

Associating Natural Language Comment and Source Code Entities

Comments are an integral part of software development; they are natural language descriptions associated with source code elements. Understanding explicit associations can be useful in improving code comprehensibility and maintaining the…

Computation and Language · Computer Science 2019-12-17 Sheena Panthaplackel , Milos Gligoric , Raymond J. Mooney , Junyi Jessy Li

Recommendations for Datasets for Source Code Summarization

Source Code Summarization is the task of writing short, natural language descriptions of source code. The main use for these descriptions is in software documentation e.g. the one-sentence Java method descriptions in JavaDocs. Code…

Computation and Language · Computer Science 2019-04-05 Alexander LeClair , Collin McMillan

CROWDLAB: Supervised learning to infer consensus labels and quality scores for data with multiple annotators

Real-world data for classification is often labeled by multiple annotators. For analyzing such data, we introduce CROWDLAB, a straightforward approach to utilize any trained classifier to estimate: (1) A consensus label for each example…

Machine Learning · Computer Science 2023-01-30 Hui Wen Goh , Ulyana Tkachenko , Jonas Mueller

LabelVizier: Interactive Validation and Relabeling for Technical Text Annotations

With the rapid accumulation of text data produced by data-driven techniques, the task of extracting "data annotations"--concise, high-quality data summaries from unstructured raw text--has become increasingly important. The recent advances…

Human-Computer Interaction · Computer Science 2023-04-03 Xiaoyu Zhang , Xiwei Xuan , Alden Dima , Thurston Sexton , Kwan-Liu Ma

A modelling language for the effective design of Java annotations

This paper describes a new modelling language for the effective design of Java annotations. Since their inclusion in the 5th edition of Java, annotations have grown from a useful tool for the addition of meta-data to play a central role in…

Programming Languages · Computer Science 2019-10-02 Irene Córdoba , Juan de Lara

SCALAR: A Part-of-speech Tagger for Identifiers

The paper presents the Source Code Analysis and Lexical Annotation Runtime (SCALAR), a tool specialized for mapping (annotating) source code identifier names to their corresponding part-of-speech tag sequence (grammar pattern). SCALAR's…

Software Engineering · Computer Science 2025-04-25 Christian D. Newman , Brandon Scholten , Sophia Testa , Joshua A. C. Behler , Syreen Banabilah , Michael L. Collard , Michael J. Decker , Mohamed Wiem Mkaouer , Marcos Zampieri , Eman Abdullah AlOmar , Reem Alsuhaibani , Anthony Peruma , Jonathan I. Maletic

Self-Supervised Contrastive Learning for Code Retrieval and Summarization via Semantic-Preserving Transformations

We propose Corder, a self-supervised contrastive learning framework for source code model. Corder is designed to alleviate the need of labeled data for code retrieval and code summarization tasks. The pre-trained model of Corder can be used…

Software Engineering · Computer Science 2021-05-25 Nghi D. Q. Bui , Yijun Yu , Lingxiao Jiang

ActiveLab: Active Learning with Re-Labeling by Multiple Annotators

In real-world data labeling applications, annotators often provide imperfect labels. It is thus common to employ multiple annotators to label data with some overlap between their examples. We study active learning in such settings, aiming…

Machine Learning · Computer Science 2024-07-29 Hui Wen Goh , Jonas Mueller

Learning code summarization from a small and local dataset

Foundation models (e.g., CodeBERT, GraphCodeBERT, CodeT5) work well for many software engineering tasks. These models are pre-trained (using self-supervision) with billions of code tokens, and then fine-tuned with hundreds of thousands of…

Software Engineering · Computer Science 2022-06-03 Toufique Ahmed , Premkumar Devanbu

CoDesc: A Large Code-Description Parallel Dataset

Translation between natural language and source code can help software development by enabling developers to comprehend, ideate, search, and write computer programs in natural language. Despite growing interest from the industry and the…

Computation and Language · Computer Science 2021-06-01 Masum Hasan , Tanveer Muttaqueen , Abdullah Al Ishtiaq , Kazi Sajeed Mehrab , Md. Mahim Anjum Haque , Tahmid Hasan , Wasi Uddin Ahmad , Anindya Iqbal , Rifat Shahriyar

Curator: Creating Large-Scale Curated Labelled Datasets using Self-Supervised Learning

Applying Machine learning to domains like Earth Sciences is impeded by the lack of labeled data, despite a large corpus of raw data available in such domains. For instance, training a wildfire classifier on satellite imagery requires…

Computer Vision and Pattern Recognition · Computer Science 2023-01-02 Tarun Narayanan , Ajay Krishnan , Anirudh Koul , Siddha Ganju

CodeLens: An Interactive Tool for Visualizing Code Representations

Representing source code in a generic input format is crucial to automate software engineering tasks, e.g., applying machine learning algorithms to extract information. Visualizing code representations can further enable human experts to…

Software Engineering · Computer Science 2023-07-28 Yuejun Guo , Seifeddine Bettaieb , Qiang Hu , Yves Le Traon , Qiang Tang

CatalogBank: A Structured and Interoperable Catalog Dataset with a Semi-Automatic Annotation Tool (DocumentLabeler) for Engineering System Design

In the realm of document engineering and Natural Language Processing (NLP), the integration of digitally born catalogs into product design processes presents a novel avenue for enhancing information extraction and interoperability. This…

Systems and Control · Electrical Eng. & Systems 2024-08-16 Hasan Sinan Bank , Daniel R. Herber

Label Assistant: A Workflow for Assisted Data Annotation in Image Segmentation Tasks

Recent research in the field of computer vision strongly focuses on deep learning architectures to tackle image processing problems. Deep neural networks are often considered in complex image processing scenarios since traditional computer…

Computer Vision and Pattern Recognition · Computer Science 2021-11-30 Marcel P. Schilling , Luca Rettenberger , Friedrich Münke , Haijun Cui , Anna A. Popova , Pavel A. Levkin , Ralf Mikut , Markus Reischl

EduCoder: An Open-Source Annotation System for Education Transcript Data

We introduce EduCoder, a domain-specialized tool designed to support utterance-level annotation of educational dialogue. While general-purpose text annotation tools for NLP and qualitative research abound, few address the complexities of…

Computation and Language · Computer Science 2026-05-06 Saad Ashraf , James Malamut , Vishal Kumar , Guanzhong Pan , Hyunji Nam , Mei Tan , Lucía Langlois , Liliana Deonizio , Helen Higgins , Dorottya Demszky

LABELING COPILOT: A Deep Research Agent for Automated Data Curation in Computer Vision

Curating high-quality, domain-specific datasets is a major bottleneck for deploying robust vision systems, requiring complex trade-offs between data quality, diversity, and cost when researching vast, unlabeled data lakes. We introduce…

Computer Vision and Pattern Recognition · Computer Science 2025-09-29 Debargha Ganguly , Sumit Kumar , Ishwar Balappanawar , Weicong Chen , Shashank Kambhatla , Srinivasan Iyengar , Shivkumar Kalyanaraman , Ponnurangam Kumaraguru , Vipin Chaudhary

CORAL: COde RepresentAtion Learning with Weakly-Supervised Transformers for Analyzing Data Analysis

Large scale analysis of source code, and in particular scientific source code, holds the promise of better understanding the data science process, identifying analytical best practices, and providing insights to the builders of scientific…

Machine Learning · Computer Science 2020-09-01 Ge Zhang , Mike A. Merrill , Yang Liu , Jeffrey Heer , Tim Althoff