Related papers: Toxicity Classification in Ukrainian

Bridging the domain gap in cross-lingual document classification

The scarcity of labeled training data often prohibits the internationalization of NLP models to multiple languages. Recent developments in cross-lingual understanding (XLU) has made progress in this area, trying to bridge the language…

Computation and Language · Computer Science 2019-09-23 Guokun Lai , Barlas Oguz , Yiming Yang , Veselin Stoyanov

ADMEDTAGGER: an annotation framework for distillation of expert knowledge for the Polish medical language

In this work, we present an annotation framework that demonstrates how a multilingual LLM pretrained on a large corpus can be used as a teacher model to distill the expert knowledge needed for tagging medical texts in Polish. This work is…

Computation and Language · Computer Science 2026-05-19 Franciszek Górski , Andrzej Czyżewski

Detecting Text Formality: A Study of Text Classification Approaches

Formality is one of the important characteristics of text documents. The automatic detection of the formality level of a text is potentially beneficial for various natural language processing tasks. Before, two large-scale datasets were…

Computation and Language · Computer Science 2023-09-11 Daryna Dementieva , Nikolay Babakov , Alexander Panchenko

Cross Script Hindi English NER Corpus from Wikipedia

The text generated on social media platforms is essentially a mixed lingual text. The mixing of language in any form produces considerable amount of difficulty in language processing systems. Moreover, the advancements in language…

Information Retrieval · Computer Science 2018-10-09 Mohd Zeeshan Ansari , Tanvir Ahmad , Md Arshad Ali

A Multi-Task Text Classification Pipeline with Natural Language Explanations: A User-Centric Evaluation in Sentiment Analysis and Offensive Language Identification in Greek Tweets

Interpretability is a topic that has been in the spotlight for the past few years. Most existing interpretability techniques produce interpretations in the form of rules or feature importance. These interpretations, while informative, may…

Computation and Language · Computer Science 2024-10-15 Nikolaos Mylonas , Nikolaos Stylianou , Theodora Tsikrika , Stefanos Vrochidis , Ioannis Kompatsiaris

Current Landscape of the Russian Sentiment Corpora

Currently, there are more than a dozen Russian-language corpora for sentiment analysis, differing in the source of the texts, domain, size, number and ratio of sentiment classes, and annotation method. This work examines publicly available…

Computation and Language · Computer Science 2021-06-29 Evgeny Kotelnikov

Self-Supervised Knowledge Assimilation for Expert-Layman Text Style Transfer

Expert-layman text style transfer technologies have the potential to improve communication between members of scientific communities and the general public. High-quality information produced by experts is often filled with difficult jargon…

Computation and Language · Computer Science 2021-12-21 Wenda Xu , Michael Saxon , Misha Sra , William Yang Wang

Multilingual Clinical NER: Translation or Cross-lingual Transfer?

Natural language tasks like Named Entity Recognition (NER) in the clinical domain on non-English texts can be very time-consuming and expensive due to the lack of annotated data. Cross-lingual transfer (CLT) is a way to circumvent this…

Computation and Language · Computer Science 2023-06-08 Xavier Fontaine , Félix Gaschi , Parisa Rastin , Yannick Toussaint

Exploring Large Language Models in Healthcare: Insights into Corpora Sources, Customization Strategies, and Evaluation Metrics

This study reviewed the use of Large Language Models (LLMs) in healthcare, focusing on their training corpora, customization techniques, and evaluation metrics. A systematic search of studies from 2021 to 2024 identified 61 articles. Four…

Computation and Language · Computer Science 2025-02-18 Shuqi Yang , Mingrui Jing , Shuai Wang , Jiaxin Kou , Manfei Shi , Weijie Xing , Yan Hu , Zheng Zhu

Probing LLMs for Multilingual Discourse Generalization Through a Unified Label Set

Discourse understanding is essential for many NLP tasks, yet most existing work remains constrained by framework-dependent discourse representations. This work investigates whether large language models (LLMs) capture discourse knowledge…

Computation and Language · Computer Science 2025-06-05 Florian Eichin , Yang Janet Liu , Barbara Plank , Michael A. Hedderich

Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore's Low-Resource Languages

The advancement of Large Language Models (LLMs) has transformed natural language processing; however, their safety mechanisms remain under-explored in low-resource, multilingual settings. Here, we aim to bridge this gap. In particular, we…

Computation and Language · Computer Science 2025-09-24 Yujia Hu , Ming Shan Hee , Preslav Nakov , Roy Ka-Wei Lee

A Collaborative Content Moderation Framework for Toxicity Detection based on Conformalized Estimates of Annotation Disagreement

Content moderation typically combines the efforts of human moderators and machine learning models. However, these systems often rely on data where significant disagreement occurs during moderation, reflecting the subjective nature of…

Computation and Language · Computer Science 2025-09-01 Guillermo Villate-Castillo , Javier Del Ser , Borja Sanz

ylmmcl at Multilingual Text Detoxification 2025: Lexicon-Guided Detoxification and Classifier-Gated Rewriting

In this work, we introduce our solution for the Multilingual Text Detoxification Task in the PAN-2025 competition for the ylmmcl team: a robust multilingual text detoxification pipeline that integrates lexicon-guided tagging, a fine-tuned…

Computation and Language · Computer Science 2025-07-28 Nicole Lai-Lopez , Lusha Wang , Su Yuan , Liza Zhang

Automatic Labeling for Entity Extraction in Cyber Security

Timely analysis of cyber-security information necessitates automated information extraction from unstructured text. While state-of-the-art extraction methods produce extremely accurate results, they require ample training data, which is…

Information Retrieval · Computer Science 2014-06-11 Robert A. Bridges , Corinne L. Jones , Michael D. Iannacone , Kelly M. Testa , John R. Goodall

Ground-Truth, Whose Truth? -- Examining the Challenges with Annotating Toxic Text Datasets

The use of machine learning (ML)-based language models (LMs) to monitor content online is on the rise. For toxic text identification, task-specific fine-tuning of these models are performed using datasets labeled by annotators who provide…

Computation and Language · Computer Science 2021-12-08 Kofi Arhin , Ioana Baldini , Dennis Wei , Karthikeyan Natesan Ramamurthy , Moninder Singh

Towards an automatic recognition of mixed languages: The Ukrainian-Russian hybrid language Surzhyk

Language interference is common in today's multilingual societies where more languages are being in contact and as a global final result leads to the creation of hybrid languages. These, together with doubts on their right to be officially…

Computation and Language · Computer Science 2019-12-19 Nataliya Sira , Giorgio Maria Di Nunzio , Viviana Nosilia

Simple Text Detoxification by Identifying a Linear Toxic Subspace in Language Model Embeddings

Large pre-trained language models are often trained on large volumes of internet data, some of which may contain toxic or abusive language. Consequently, language models encode toxic information, which makes the real-world usage of these…

Computation and Language · Computer Science 2021-12-16 Andrew Wang , Mohit Sudhakar , Yangfeng Ji

Systematic Rectification of Language Models via Dead-end Analysis

With adversarial or otherwise normal prompts, existing large language models (LLM) can be pushed to generate toxic discourses. One way to reduce the risk of LLMs generating undesired discourses is to alter the training of the LLM. This can…

Computation and Language · Computer Science 2023-02-28 Meng Cao , Mehdi Fatemi , Jackie Chi Kit Cheung , Samira Shabanian

The Language You Ask In: Language-Conditioned Ideological Divergence in LLM Analysis of Contested Political Documents

Large language models (LLMs) are increasingly deployed as analytical tools across multilingual contexts, yet their outputs may carry systematic biases conditioned by the language of the prompt. This study presents an experimental comparison…

Computers and Society · Computer Science 2026-02-03 Oleg Smirnov

Learning Multilingual Topics from Incomparable Corpus

Multilingual topic models enable crosslingual tasks by extracting consistent topics from multilingual corpora. Most models require parallel or comparable training corpora, which limits their ability to generalize. In this paper, we first…

Computation and Language · Computer Science 2018-06-13 Shudong Hao , Michael J. Paul