Related papers: COMMENTATOR: A Code-mixed Multilingual Text Annota…

COMI-LINGUA: Expert Annotated Large-Scale Dataset for Multitask NLP in Hindi-English Code-Mixing

We introduce COMI-LINGUA, the largest manually annotated Hindi-English code-mixed dataset, comprising 125K+ high-quality instances across five core NLP tasks: Matrix Language Identification, Token-level Language Identification,…

Computation and Language · Computer Science 2025-09-18 Rajvee Sheth , Himanshu Beniwal , Mayank Singh

Dopamin: Transformer-based Comment Classifiers through Domain Post-Training and Multi-level Layer Aggregation

Code comments provide important information for understanding the source code. They can help developers understand the overall purpose of a function or class, as well as identify bugs and technical debt. However, an overabundance of…

Computation and Language · Computer Science 2024-08-12 Nam Le Hai , Nghi D. Q. Bui

The DALPHI annotation framework & how its pre-annotations can improve annotator efficiency

Producing the required amounts of training data for machine learning and NLP tasks often involves human annotators doing very repetitive and monotonous work. In this paper, we present and evaluate our novel annotation framework DALPHI,…

Information Retrieval · Computer Science 2018-08-20 Robert Greinacher , Franziska Horn

ARTICLE: Annotator Reliability Through In-Context Learning

Ensuring annotator quality in training and evaluation data is a key piece of machine learning in NLP. Tasks such as sentiment analysis and offensive speech detection are intrinsically subjective, creating a challenging scenario for…

Computation and Language · Computer Science 2024-09-23 Sujan Dutta , Deepak Pandita , Tharindu Cyril Weerasooriya , Marcos Zampieri , Christopher M. Homan , Ashiqur R. KhudaBukhsh

CodemixedNLP: An Extensible and Open NLP Toolkit for Code-Mixing

The NLP community has witnessed steep progress in a variety of tasks across the realms of monolingual and multilingual language processing recently. These successes, in conjunction with the proliferating mixed language interactions on…

Computation and Language · Computer Science 2021-06-14 Sai Muralidhar Jayanthi , Kavya Nerella , Khyathi Raghavi Chandu , Alan W Black

Annotator in the Loop: A Case Study of In-Depth Rater Engagement to Create a Bridging Benchmark Dataset

With the growing prevalence of large language models, it is increasingly common to annotate datasets for machine learning using pools of crowd raters. However, these raters often work in isolation as individual crowdworkers. In this work,…

Computers and Society · Computer Science 2024-08-05 Sonja Schmer-Galunder , Ruta Wheelock , Scott Friedman , Alyssa Chvasta , Zaria Jalan , Emily Saltz

Revisiting the Role of Natural Language Code Comments in Code Translation

The advent of large language models (LLMs) has ushered in a new era in automated code translation across programming languages. Since most code-specific LLMs are pretrained on well-commented code from large repositories like GitHub, it is…

Software Engineering · Computer Science 2026-01-26 Monika Gupta , Ajay Meena , Anamitra Roy Choudhury , Vijay Arya , Srikanta Bedathur

On User Interfaces for Large-Scale Document-Level Human Evaluation of Machine Translation Outputs

Recent studies emphasize the need of document context in human evaluation of machine translations, but little research has been done on the impact of user interfaces on annotator productivity and the reliability of assessments. In this…

Computation and Language · Computer Science 2021-04-22 Roman Grundkiewicz , Marcin Junczys-Dowmunt , Christian Federmann , Tom Kocmi

CONFLATOR: Incorporating Switching Point based Rotatory Positional Encodings for Code-Mixed Language Modeling

The mixing of two or more languages is called Code-Mixing (CM). CM is a social norm in multilingual societies. Neural Language Models (NLMs) like transformers have been effective on many NLP tasks. However, NLM for CM is an under-explored…

Computation and Language · Computer Science 2023-10-20 Mohsin Ali , Kandukuri Sai Teja , Neeharika Gupta , Parth Patwa , Anubhab Chatterjee , Vinija Jain , Aman Chadha , Amitava Das

Enhancing Text Classification through LLM-Driven Active Learning and Human Annotation

In the context of text classification, the financial burden of annotation exercises for creating training data is a critical issue. Active learning techniques, particularly those rooted in uncertainty sampling, offer a cost-effective…

Computation and Language · Computer Science 2024-06-19 Hamidreza Rouzegar , Masoud Makrehchi

Code-Mixed to Monolingual Translation Framework

The use of multilingualism in the new generation is widespread in the form of code-mixed data on social media, and therefore a robust translation system is required for catering to the monolingual users, as well as for easier comprehension…

Computation and Language · Computer Science 2019-11-25 Sainik Kumar Mahata , Soumil Mandal , Dipankar Das , Sivaji Bandyopadhyay

CHAMP: Efficient Annotation and Consolidation of Cluster Hierarchies

Various NLP tasks require a complex hierarchical structure over nodes, where each node is a cluster of items. Examples include generating entailment graphs, hierarchical cross-document coreference resolution, annotating event and subevent…

Computation and Language · Computer Science 2023-11-21 Arie Cattan , Tom Hope , Doug Downey , Roy Bar-Haim , Lilach Eden , Yoav Kantor , Ido Dagan

Challenges and Considerations with Code-Mixed NLP for Multilingual Societies

Multilingualism refers to the high degree of proficiency in two or more languages in the written and oral communication modes. It often results in language mixing, a.k.a. code-mixing, when a multilingual speaker switches between multiple…

Computation and Language · Computer Science 2021-06-16 Vivek Srivastava , Mayank Singh

Can External Validation Tools Improve Annotation Quality for LLM-as-a-Judge?

Pairwise preferences over model responses are widely collected to evaluate and provide feedback to large language models (LLMs). Given two alternative model responses to the same input, a human or AI annotator selects the "better" response.…

Computation and Language · Computer Science 2025-07-24 Arduin Findeis , Floris Weers , Guoli Yin , Ke Ye , Ruoming Pang , Tom Gunter

WhoSaidIt: Human-LLM Collaborative Annotation for Text-Based Multilingual Speaker-Attribute Classification

Annotating speaker attributes from text is inherently ambiguous, particularly in multilingual settings where demographic and social cues are implicit and culturally variable. We propose a human-large language model (LLM) collaborative…

Computation and Language · Computer Science 2026-05-26 Lingyu Gao , Will Monroe , David Smith , Meghan Jemison , Jackie Lee

LinguistAgent: A Reflective Multi-Model Platform for Automated Linguistic Annotation

Data annotation remains a significant bottleneck in the Humanities and Social Sciences, particularly for complex semantic tasks such as metaphor identification. While Large Language Models (LLMs) show promise, a significant gap remains…

Computation and Language · Computer Science 2026-02-06 Bingru Li

Augmenting Image Annotation: A Human-LMM Collaborative Framework for Efficient Object Selection and Label Generation

Traditional image annotation tasks rely heavily on human effort for object selection and label assignment, making the process time-consuming and prone to decreased efficiency as annotators experience fatigue after extensive work. This paper…

Computer Vision and Pattern Recognition · Computer Science 2025-03-17 He Zhang , Xinyi Fu , John M. Carroll

Unveiling the Multi-Annotation Process: Examining the Influence of Annotation Quantity and Instance Difficulty on Model Performance

The NLP community has long advocated for the construction of multi-annotator datasets to better capture the nuances of language interpretation, subjectivity, and ambiguity. This paper conducts a retrospective study to show how performance…

Computation and Language · Computer Science 2023-10-24 Pritam Kadasi , Mayank Singh

Are Expert-Level Language Models Expert-Level Annotators?

Data annotation refers to the labeling or tagging of textual data with relevant information. A large body of works have reported positive results on leveraging LLMs as an alternative to human annotators. However, existing studies focus on…

Computation and Language · Computer Science 2024-10-07 Yu-Min Tseng , Wei-Lin Chen , Chung-Chi Chen , Hsin-Hsi Chen

Exploring Text-to-Text Transformers for English to Hinglish Machine Translation with Synthetic Code-Mixing

We describe models focused at the understudied problem of translating between monolingual and code-mixed language pairs. More specifically, we offer a wide range of models that convert monolingual English text into Hinglish (code-mixed…

Computation and Language · Computer Science 2021-05-20 Ganesh Jawahar , El Moatez Billah Nagoudi , Muhammad Abdul-Mageed , Laks V. S. Lakshmanan