Related papers: Toxicity Classification in Ukrainian

Cross-lingual Text Classification Transfer: The Case of Ukrainian

Despite the extensive amount of labeled datasets in the NLP text classification field, the persistent imbalance in data availability across various languages remains evident. To support further fair development of NLP models, exploring the…

Computation and Language · Computer Science 2025-02-06 Daryna Dementieva , Valeriia Khylenko , Georg Groh

CrossNews-UA: A Cross-lingual News Semantic Similarity Benchmark for Ukrainian, Polish, Russian, and English

In the era of social networks and rapid misinformation spread, news analysis remains a critical task. Detecting fake news across multiple languages, particularly beyond English, poses significant challenges. Cross-lingual news comparison…

Computation and Language · Computer Science 2025-10-23 Daryna Dementieva , Evgeniya Sukhodolskaya , Alexander Fraser

Clustering Comparable Corpora of Russian and Ukrainian Academic Texts: Word Embeddings and Semantic Fingerprints

We present our experience in applying distributional semantics (neural word embeddings) to the problem of representing and clustering documents in a bilingual comparable corpus. Our data is a collection of Russian and Ukrainian academic…

Computation and Language · Computer Science 2016-04-20 Andrey Kutuzov , Mikhail Kopotev , Tatyana Sviridenko , Lyubov Ivanova

Translate, then Detect: Leveraging Machine Translation for Cross-Lingual Toxicity Classification

Multilingual toxicity detection remains a significant challenge due to the scarcity of training data and resources for many languages. While prior work has leveraged the translate-test paradigm to support cross-lingual transfer across a…

Computation and Language · Computer Science 2025-09-19 Samuel J. Bell , Eduardo Sánchez , David Dale , Pontus Stenetorp , Mikel Artetxe , Marta R. Costa-jussà

Enhancing Multilingual Voice Toxicity Detection with Speech-Text Alignment

Toxicity classification for voice heavily relies on the semantic content of speech. We propose a novel framework that utilizes cross-modal learning to integrate the semantic embedding of text into a multilabel speech toxicity classifier…

Computation and Language · Computer Science 2024-11-19 Joseph Liu , Mahesh Kumar Nandwana , Janne Pylkkönen , Hannes Heikinheimo , Morgan McGuire

Rethinking Toxicity Evaluation in Large Language Models: A Multi-Label Perspective

Large language models (LLMs) have achieved impressive results across a range of natural language processing tasks, but their potential to generate harmful content has raised serious safety concerns. Current toxicity detectors primarily rely…

Computation and Language · Computer Science 2025-10-20 Zhiqiang Kou , Junyang Chen , Xin-Qiang Cai , Ming-Kun Xie , Biao Liu , Changwei Wang , Lei Feng , Yuheng Jia , Gang Niu , Masashi Sugiyama , Xin Geng

EmoBench-UA: A Benchmark Dataset for Emotion Detection in Ukrainian

While Ukrainian NLP has seen progress in many texts processing tasks, emotion classification remains an underexplored area with no publicly available benchmark to date. In this work, we introduce EmoBench-UA, the first annotated dataset for…

Computation and Language · Computer Science 2025-09-29 Daryna Dementieva , Nikolay Babakov , Alexander Fraser

Can LLMs Recognize Toxicity? A Structured Investigation Framework and Toxicity Metric

In the pursuit of developing Large Language Models (LLMs) that adhere to societal standards, it is imperative to detect the toxicity in the generated text. The majority of existing toxicity metrics rely on encoder models trained on specific…

Computation and Language · Computer Science 2024-11-15 Hyukhun Koh , Dohyung Kim , Minwoo Lee , Kyomin Jung

FrenchToxicityPrompts: a Large Benchmark for Evaluating and Mitigating Toxicity in French Texts

Large language models (LLMs) are increasingly popular but are also prone to generating bias, toxic or harmful language, which can have detrimental effects on individuals and communities. Although most efforts is put to assess and mitigate…

Computation and Language · Computer Science 2024-06-26 Caroline Brun , Vassilina Nikoulina

Cross-lingual Distillation for Text Classification

Cross-lingual text classification(CLTC) is the task of classifying documents written in different languages into the same taxonomy of categories. This paper presents a novel approach to CLTC that builds on model distillation, which adapts…

Computation and Language · Computer Science 2018-03-29 Ruochen Xu , Yiming Yang

RusLICA: A Russian-Language Platform for Automated Linguistic Inquiry and Category Analysis

Defining psycholinguistic characteristics in written texts is a task gaining increasing attention from researchers. One of the most widely used tools in the current field is Linguistic Inquiry and Word Count (LIWC) that originally was…

Computation and Language · Computer Science 2026-01-29 Elina Sigdel , Anastasia Panfilova

Breaking mBad! Supervised Fine-tuning for Cross-Lingual Detoxification

As large language models (LLMs) become increasingly prevalent in global applications, ensuring that they are toxicity-free across diverse linguistic contexts remains a critical challenge. We explore "Cross-lingual Detoxification", a…

Computation and Language · Computer Science 2025-10-24 Himanshu Beniwal , Youngwoo Kim , Maarten Sap , Soham Dan , Thomas Hartvigsen

RTP-LX: Can LLMs Evaluate Toxicity in Multilingual Scenarios?

Large language models (LLMs) and small language models (SLMs) are being adopted at remarkable speed, although their safety still remains a serious concern. With the advent of multilingual S/LLMs, the question now becomes a matter of scale:…

Computation and Language · Computer Science 2025-05-05 Adrian de Wynter , Ishaan Watts , Tua Wongsangaroonsri , Minghui Zhang , Noura Farra , Nektar Ege Altıntoprak , Lena Baur , Samantha Claudet , Pavel Gajdusek , Can Gören , Qilong Gu , Anna Kaminska , Tomasz Kaminski , Ruby Kuo , Akiko Kyuba , Jongho Lee , Kartik Mathur , Petter Merok , Ivana Milovanović , Nani Paananen , Vesa-Matti Paananen , Anna Pavlenko , Bruno Pereira Vidal , Luciano Strika , Yueh Tsao , Davide Turcato , Oleksandr Vakhno , Judit Velcsov , Anna Vickers , Stéphanie Visser , Herdyan Widarmanto , Andrey Zaikin , Si-Qing Chen

T3L: Translate-and-Test Transfer Learning for Cross-Lingual Text Classification

Cross-lingual text classification leverages text classifiers trained in a high-resource language to perform text classification in other languages with no or minimal fine-tuning (zero/few-shots cross-lingual transfer). Nowadays,…

Computation and Language · Computer Science 2023-06-09 Inigo Jauregi Unanue , Gholamreza Haffari , Massimo Piccardi

A Corpus for Multilingual Document Classification in Eight Languages

Cross-lingual document classification aims at training a document classifier on resources in one language and transferring it to a different language without any additional resources. Several approaches have been proposed in the literature…

Computation and Language · Computer Science 2018-05-28 Holger Schwenk , Xian Li

Toxicity Detection with Generative Prompt-based Inference

Due to the subtleness, implicity, and different possible interpretations perceived by different people, detecting undesirable content from text is a nuanced difficulty. It is a long-known risk that language models (LMs), once trained on…

Computation and Language · Computer Science 2022-05-26 Yau-Shian Wang , Yingshan Chang

Cross-lingual Named Entity Corpus for Slavic Languages

This paper presents a corpus manually annotated with named entities for six Slavic languages - Bulgarian, Czech, Polish, Slovenian, Russian, and Ukrainian. This work is the result of a series of shared tasks, conducted in 2017-2023 as a…

Computation and Language · Computer Science 2024-04-09 Jakub Piskorski , Michał Marcińczuk , Roman Yangarber

Mitigating Biases to Embrace Diversity: A Comprehensive Annotation Benchmark for Toxic Language

This study introduces a prescriptive annotation benchmark grounded in humanities research to ensure consistent, unbiased labeling of offensive language, particularly for casual and non-mainstream language uses. We contribute two newly…

Computation and Language · Computer Science 2024-10-18 Xinmeng Hou

Entities as topic labels: Improving topic interpretability and evaluability combining Entity Linking and Labeled LDA

In order to create a corpus exploration method providing topics that are easier to interpret than standard LDA topic models, here we propose combining two techniques called Entity linking and Labeled LDA. Our method identifies in an…

Computation and Language · Computer Science 2016-04-27 Federico Nanni , Pablo Ruiz Fabo

Obscuring Data Contamination Through Translation: Evidence from Arabic Corpora

Data contamination undermines the validity of Large Language Model evaluation by enabling models to rely on memorized benchmark content rather than true generalization. While prior work has proposed contamination detection methods, these…

Computation and Language · Computer Science 2026-01-22 Chaymaa Abbas , Nour Shamaa , Mariette Awad