Related papers: Toxicity Classification in Ukrainian

Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks

The rapid adoption of LLMs in both research and industry highlights the challenges of deploying them safely and reveals a gap in the systematic evaluation of toxicity benchmarks. As organizations increasingly rely on these benchmarks to…

Artificial Intelligence · Computer Science 2026-05-12 Regina Gugg , Selina Niederländer , Andreas Stöckl , Martin Flechl

Challenges for Toxic Comment Classification: An In-Depth Error Analysis

Toxic comment classification has become an active research field with many recently proposed approaches. However, while these approaches address some of the task's challenges others still remain unsolved and directions for further research…

Computation and Language · Computer Science 2018-09-21 Betty van Aken , Julian Risch , Ralf Krestel , Alexander Löser

The Grammar and Syntax Based Corpus Analysis Tool For The Ukrainian Language

This paper provides an overview of a text mining tool the StyloMetrix developed initially for the Polish language and further extended for English and recently for Ukrainian. The StyloMetrix is built upon various metrics crafted manually by…

Computation and Language · Computer Science 2023-05-24 Daria Stetsenko , Inez Okulska

Defining, Understanding, and Detecting Online Toxicity: Challenges and Machine Learning Approaches

Online toxic content has grown into a pervasive phenomenon, intensifying during times of crisis, elections, and social unrest. A significant amount of research has been focused on detecting or analyzing toxic content using machine-learning…

Computation and Language · Computer Science 2025-09-19 Gautam Kishore Shahi , Tim A. Majchrzak

About the creation of a parallel bilingual corpora of web-publications

The algorithm of the creation texts parallel corpora was presented. The algorithm is based on the use of "key words" in text documents, and on the means of their automated translation. Key words were singled out by means of using Russian…

Computation and Language · Computer Science 2008-07-03 D. V. Lande , V. V. Zhygalo

Challenges in Detoxifying Language Models

Large language models (LM) generate remarkably fluent text and can be efficiently adapted across NLP tasks. Measuring and guaranteeing the quality of generated text in terms of safety is imperative for deploying LMs in the real world; to…

Computation and Language · Computer Science 2021-09-16 Johannes Welbl , Amelia Glaese , Jonathan Uesato , Sumanth Dathathri , John Mellor , Lisa Anne Hendricks , Kirsty Anderson , Pushmeet Kohli , Ben Coppin , Po-Sen Huang

Measuring the Effect of Disfluency in Multilingual Knowledge Probing Benchmarks

For multilingual factual knowledge assessment of LLMs, benchmarks such as MLAMA use template translations that do not take into account the grammatical and semantic information of the named entities inserted in the sentence. This leads to…

Computation and Language · Computer Science 2025-10-20 Kirill Semenov , Rico Sennrich

Evaluating Latent Knowledge of Public Tabular Datasets in Large Language Models

Large language models (LLMs) are increasingly exposed to data contamination, i.e., performance gains driven by prior exposure of test datasets rather than generalization. However, in the context of tabular data, this problem is largely…

Computation and Language · Computer Science 2026-03-31 Matteo Silvestri , Fabiano Veglianti , Flavio Giorgi , Fabrizio Silvestri , Gabriele Tolomei

You Only Prompt Once: On the Capabilities of Prompt Learning on Large Language Models to Tackle Toxic Content

The spread of toxic content online is an important problem that has adverse effects on user experience online and in our society at large. Motivated by the importance and impact of the problem, research focuses on developing solutions to…

Computation and Language · Computer Science 2023-08-11 Xinlei He , Savvas Zannettou , Yun Shen , Yang Zhang

Exploring Methods for Cross-lingual Text Style Transfer: The Case of Text Detoxification

Text detoxification is the task of transferring the style of text from toxic to neutral. While here are approaches yielding promising results in monolingual setup, e.g., (Dale et al., 2021; Hallinan et al., 2022), cross-lingual transfer for…

Computation and Language · Computer Science 2023-11-27 Daryna Dementieva , Daniil Moskovskiy , David Dale , Alexander Panchenko

Cross-Lingual Transfer of Debiasing and Detoxification in Multilingual LLMs: An Extensive Investigation

Recent generative large language models (LLMs) show remarkable performance in non-English languages, but when prompted in those languages they tend to express higher harmful social biases and toxicity levels. Prior work has shown that…

Computation and Language · Computer Science 2025-06-03 Vera Neplenbroek , Arianna Bisazza , Raquel Fernández

Hidden Persuasion: Detecting Manipulative Narratives on Social Media During the 2022 Russian Invasion of Ukraine

This paper presents one of the top-performing solutions to the UNLP 2025 Shared Task on Detecting Manipulation in Social Media. The task focuses on detecting and classifying rhetorical and stylistic manipulation techniques used to influence…

Computation and Language · Computer Science 2025-06-02 Kateryna Akhynko , Oleksandr Kosovan , Mykola Trokhymovych

Beyond Toxic: Toxicity Detection Datasets are Not Enough for Brand Safety

The rapid growth in user generated content on social media has resulted in a significant rise in demand for automated content moderation. Various methods and frameworks have been proposed for the tasks of hate speech detection and toxic…

Computation and Language · Computer Science 2024-09-27 Elizaveta Korotkova , Isaac Chung

Ukrainian-to-English folktale corpus: Parallel corpus creation and augmentation for machine translation in low-resource languages

Folktales are linguistically very rich and culturally significant in understanding the source language. Historically, only human translation has been used for translating folklore. Therefore, the number of translated texts is very sparse,…

Computation and Language · Computer Science 2024-10-15 Olena Burda-Lassen

BIPOLAR: Polarization-based granular framework for LLM bias evaluation

Large language models (LLMs) are known to exhibit biases in downstream tasks, especially when dealing with sensitive topics such as political discourse, gender identity, ethnic relations, or national stereotypes. Although significant…

Computation and Language · Computer Science 2025-08-18 Martin Pavlíček , Tomáš Filip , Petr Sosík

RuCoLA: Russian Corpus of Linguistic Acceptability

Linguistic acceptability (LA) attracts the attention of the research community due to its many uses, such as testing the grammatical knowledge of language models and filtering implausible texts with acceptability classifiers. However, the…

Computation and Language · Computer Science 2023-10-04 Vladislav Mikhailov , Tatiana Shamardina , Max Ryabinin , Alena Pestova , Ivan Smurov , Ekaterina Artemova

Czech Text Document Corpus v 2.0

This paper introduces "Czech Text Document Corpus v 2.0", a collection of text documents for automatic document classification in Czech language. It is composed of the text documents provided by the Czech News Agency and is freely available…

Computation and Language · Computer Science 2018-02-01 Pavel Král , Ladislav Lenc

Data Contamination Can Cross Language Barriers

The opacity in developing large language models (LLMs) is raising growing concerns about the potential contamination of public benchmarks in the pre-training data. Existing contamination detection methods are typically based on the text…

Computation and Language · Computer Science 2024-10-31 Feng Yao , Yufan Zhuang , Zihao Sun , Sunan Xu , Animesh Kumar , Jingbo Shang

Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies

With the ongoing growth in number of digital articles in a wider set of languages and the expanding use of different languages, we need annotation methods that enable browsing multi-lingual corpora. Multilingual probabilistic topic models…

Computation and Language · Computer Science 2021-01-11 Carlos Badenes-Olmedo , Jose-Luis Redondo García , Oscar Corcho

Dialectal Toxicity Detection: Evaluating LLM-as-a-Judge Consistency Across Language Varieties

There has been little systematic study on how dialectal differences affect toxicity detection by modern LLMs. Furthermore, although using LLMs as evaluators ("LLM-as-a-judge") is a growing research area, their sensitivity to dialectal…

Computation and Language · Computer Science 2024-11-19 Fahim Faisal , Md Mushfiqur Rahman , Antonios Anastasopoulos