English
Related papers

Related papers: Toxicity Classification in Ukrainian

200 papers

Despite the extensive amount of labeled datasets in the NLP text classification field, the persistent imbalance in data availability across various languages remains evident. To support further fair development of NLP models, exploring the…

Computation and Language · Computer Science 2025-02-06 Daryna Dementieva , Valeriia Khylenko , Georg Groh

In the era of social networks and rapid misinformation spread, news analysis remains a critical task. Detecting fake news across multiple languages, particularly beyond English, poses significant challenges. Cross-lingual news comparison…

Computation and Language · Computer Science 2025-10-23 Daryna Dementieva , Evgeniya Sukhodolskaya , Alexander Fraser

We present our experience in applying distributional semantics (neural word embeddings) to the problem of representing and clustering documents in a bilingual comparable corpus. Our data is a collection of Russian and Ukrainian academic…

Computation and Language · Computer Science 2016-04-20 Andrey Kutuzov , Mikhail Kopotev , Tatyana Sviridenko , Lyubov Ivanova

Multilingual toxicity detection remains a significant challenge due to the scarcity of training data and resources for many languages. While prior work has leveraged the translate-test paradigm to support cross-lingual transfer across a…

Computation and Language · Computer Science 2025-09-19 Samuel J. Bell , Eduardo Sánchez , David Dale , Pontus Stenetorp , Mikel Artetxe , Marta R. Costa-jussà

Toxicity classification for voice heavily relies on the semantic content of speech. We propose a novel framework that utilizes cross-modal learning to integrate the semantic embedding of text into a multilabel speech toxicity classifier…

Computation and Language · Computer Science 2024-11-19 Joseph Liu , Mahesh Kumar Nandwana , Janne Pylkkönen , Hannes Heikinheimo , Morgan McGuire

Large language models (LLMs) have achieved impressive results across a range of natural language processing tasks, but their potential to generate harmful content has raised serious safety concerns. Current toxicity detectors primarily rely…

Computation and Language · Computer Science 2025-10-20 Zhiqiang Kou , Junyang Chen , Xin-Qiang Cai , Ming-Kun Xie , Biao Liu , Changwei Wang , Lei Feng , Yuheng Jia , Gang Niu , Masashi Sugiyama , Xin Geng

While Ukrainian NLP has seen progress in many texts processing tasks, emotion classification remains an underexplored area with no publicly available benchmark to date. In this work, we introduce EmoBench-UA, the first annotated dataset for…

Computation and Language · Computer Science 2025-09-29 Daryna Dementieva , Nikolay Babakov , Alexander Fraser

In the pursuit of developing Large Language Models (LLMs) that adhere to societal standards, it is imperative to detect the toxicity in the generated text. The majority of existing toxicity metrics rely on encoder models trained on specific…

Computation and Language · Computer Science 2024-11-15 Hyukhun Koh , Dohyung Kim , Minwoo Lee , Kyomin Jung

Large language models (LLMs) are increasingly popular but are also prone to generating bias, toxic or harmful language, which can have detrimental effects on individuals and communities. Although most efforts is put to assess and mitigate…

Computation and Language · Computer Science 2024-06-26 Caroline Brun , Vassilina Nikoulina

Cross-lingual text classification(CLTC) is the task of classifying documents written in different languages into the same taxonomy of categories. This paper presents a novel approach to CLTC that builds on model distillation, which adapts…

Computation and Language · Computer Science 2018-03-29 Ruochen Xu , Yiming Yang

Defining psycholinguistic characteristics in written texts is a task gaining increasing attention from researchers. One of the most widely used tools in the current field is Linguistic Inquiry and Word Count (LIWC) that originally was…

Computation and Language · Computer Science 2026-01-29 Elina Sigdel , Anastasia Panfilova

As large language models (LLMs) become increasingly prevalent in global applications, ensuring that they are toxicity-free across diverse linguistic contexts remains a critical challenge. We explore "Cross-lingual Detoxification", a…

Computation and Language · Computer Science 2025-10-24 Himanshu Beniwal , Youngwoo Kim , Maarten Sap , Soham Dan , Thomas Hartvigsen

Large language models (LLMs) and small language models (SLMs) are being adopted at remarkable speed, although their safety still remains a serious concern. With the advent of multilingual S/LLMs, the question now becomes a matter of scale:…

Cross-lingual text classification leverages text classifiers trained in a high-resource language to perform text classification in other languages with no or minimal fine-tuning (zero/few-shots cross-lingual transfer). Nowadays,…

Computation and Language · Computer Science 2023-06-09 Inigo Jauregi Unanue , Gholamreza Haffari , Massimo Piccardi

Cross-lingual document classification aims at training a document classifier on resources in one language and transferring it to a different language without any additional resources. Several approaches have been proposed in the literature…

Computation and Language · Computer Science 2018-05-28 Holger Schwenk , Xian Li

Due to the subtleness, implicity, and different possible interpretations perceived by different people, detecting undesirable content from text is a nuanced difficulty. It is a long-known risk that language models (LMs), once trained on…

Computation and Language · Computer Science 2022-05-26 Yau-Shian Wang , Yingshan Chang

This paper presents a corpus manually annotated with named entities for six Slavic languages - Bulgarian, Czech, Polish, Slovenian, Russian, and Ukrainian. This work is the result of a series of shared tasks, conducted in 2017-2023 as a…

Computation and Language · Computer Science 2024-04-09 Jakub Piskorski , Michał Marcińczuk , Roman Yangarber

This study introduces a prescriptive annotation benchmark grounded in humanities research to ensure consistent, unbiased labeling of offensive language, particularly for casual and non-mainstream language uses. We contribute two newly…

Computation and Language · Computer Science 2024-10-18 Xinmeng Hou

In order to create a corpus exploration method providing topics that are easier to interpret than standard LDA topic models, here we propose combining two techniques called Entity linking and Labeled LDA. Our method identifies in an…

Computation and Language · Computer Science 2016-04-27 Federico Nanni , Pablo Ruiz Fabo

Data contamination undermines the validity of Large Language Model evaluation by enabling models to rely on memorized benchmark content rather than true generalization. While prior work has proposed contamination detection methods, these…

Computation and Language · Computer Science 2026-01-22 Chaymaa Abbas , Nour Shamaa , Mariette Awad
‹ Prev 1 2 3 10 Next ›