Related papers: Toxicity Classification in Ukrainian

Multilingual and Explainable Text Detoxification with Parallel Corpora

Even with various regulations in place across countries and social media platforms (Government of India, 2021; European Parliament and Council of the European Union, 2022, digital abusive speech remains a significant issue. One potential…

Computation and Language · Computer Science 2024-12-17 Daryna Dementieva , Nikolay Babakov , Amit Ronen , Abinew Ali Ayele , Naquee Rizwan , Florian Schneider , Xintong Wang , Seid Muhie Yimam , Daniil Moskovskiy , Elisei Stakovskii , Eran Kaufman , Ashraf Elnagar , Animesh Mukherjee , Alexander Panchenko

Detecting Toxicity in News Articles: Application to Bulgarian

Online media aim for reaching ever bigger audience and for attracting ever longer attention span. This competition creates an environment that rewards sensational, fake, and toxic news. To help limit their spread and impact, we propose and…

Computation and Language · Computer Science 2019-08-27 Yoan Dinkov , Ivan Koychev , Preslav Nakov

Reducing Unintended Identity Bias in Russian Hate Speech Detection

Toxicity has become a grave problem for many online communities and has been growing across many languages, including Russian. Hate speech creates an environment of intimidation, discrimination, and may even incite some real-world violence.…

Computation and Language · Computer Science 2020-10-23 Nadezhda Zueva , Madina Kabirova , Pavel Kalaidin

Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP

The development of monolingual language models for low and mid-resource languages continues to be hindered by the difficulty in sourcing high-quality training data. In this study, we present a novel cross-lingual vocabulary transfer…

Computation and Language · Computer Science 2024-08-09 François Remy , Pieter Delobelle , Hayastan Avetisyan , Alfiya Khabibullina , Miryam de Lhoneux , Thomas Demeester

Legal document retrieval across languages: topic hierarchies based on synsets

Cross-lingual annotations of legislative texts enable us to explore major themes covered in multilingual legal data and are a key facilitator of semantic similarity when searching for similar documents. Multilingual probabilistic topic…

Information Retrieval · Computer Science 2019-12-02 Carlos Badenes-Olmedo , Jose-Luis Redondo-Garcia , Oscar Corcho

Funnelling: A New Ensemble Method for Heterogeneous Transfer Learning and its Application to Cross-Lingual Text Classification

Cross-lingual Text Classification (CLC) consists of automatically classifying, according to a common set C of classes, documents each written in one of a set of languages L, and doing so more accurately than when naively classifying each…

Machine Learning · Computer Science 2021-09-22 Andrea Esuli , Alejandro Moreo , Fabrizio Sebastiani

Noisy-parallel and comparable corpora filtering methodology for the extraction of bi-lingual equivalent data at sentence level

Text alignment and text quality are critical to the accuracy of Machine Translation (MT) systems, some NLP tools, and any other text processing tasks requiring bilingual data. This research proposes a language independent bi-sentence…

Computation and Language · Computer Science 2015-10-16 Krzysztof Wołk

Termhood-based Comparability Metrics of Comparable Corpus in Special Domain

Cross-Language Information Retrieval (CLIR) and machine translation (MT) resources, such as dictionaries and parallel corpora, are scarce and hard to come by for special domains. Besides, these resources are just limited to a few languages,…

Computation and Language · Computer Science 2013-02-20 Sa Liu , Chengzhi Zhang

Towards Inclusive Toxic Content Moderation: Addressing Vulnerabilities to Adversarial Attacks in Toxicity Classifiers Tackling LLM-generated Content

The volume of machine-generated content online has grown dramatically due to the widespread use of Large Language Models (LLMs), leading to new challenges for content moderation systems. Conventional content moderation classifiers, which…

Computation and Language · Computer Science 2026-05-26 Shaz Furniturewala , Arkaitz Zubiaga

From One to Many: Expanding the Scope of Toxicity Mitigation in Language Models

To date, toxicity mitigation in language models has almost entirely been focused on single-language settings. As language models embrace multilingual capabilities, it's crucial our safety measures keep pace. Recognizing this research gap,…

Computation and Language · Computer Science 2024-05-31 Luiza Pozzobon , Patrick Lewis , Sara Hooker , Beyza Ermis

Kr\'eyoLID From Language Identification Towards Language Mining

Automatic language identification is frequently framed as a multi-class classification problem. However, when creating digital corpora for less commonly written languages, it may be more appropriate to consider it a data mining problem. For…

Computation and Language · Computer Science 2025-03-11 Rasul Dent , Pedro Ortiz Suarez , Thibault Clérice , Benoît Sagot

ToxiFrench: Benchmarking and Enhancing Language Models via CoT Fine-Tuning for French Toxicity Detection

Detecting toxic content using language models is crucial yet challenging. While substantial progress has been made in English, toxicity detection in French remains underdeveloped, primarily due to the lack of culturally relevant,…

Computation and Language · Computer Science 2026-04-21 Axel Delaval , Shujian Yang , Haicheng Wang , Han Qiu , Jialiang Lu

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

We present a corpus professionally annotated for grammatical error correction (GEC) and fluency edits in the Ukrainian language. To the best of our knowledge, this is the first GEC corpus for the Ukrainian language. We collected texts with…

Computation and Language · Computer Science 2022-11-09 Oleksiy Syvokon , Olena Nahorna

Long Input Benchmark for Russian Analysis

Recent advancements in Natural Language Processing (NLP) have fostered the development of Large Language Models (LLMs) that can solve an immense variety of tasks. One of the key aspects of their application is their ability to work with…

Computation and Language · Computer Science 2024-08-06 Igor Churin , Murat Apishev , Maria Tikhonova , Denis Shevelev , Aydar Bulatov , Yuri Kuratov , Sergej Averkiev , Alena Fenogenova

Investigating Bias In Automatic Toxic Comment Detection: An Empirical Study

With surge in online platforms, there has been an upsurge in the user engagement on these platforms via comments and reactions. A large portion of such textual comments are abusive, rude and offensive to the audience. With machine learning…

Computation and Language · Computer Science 2021-08-17 Ayush Kumar , Pratik Kumar

RuMedBench: A Russian Medical Language Understanding Benchmark

The paper describes the open Russian medical language understanding benchmark covering several task types (classification, question answering, natural language inference, named entity recognition) on a number of novel text sets. Given the…

Computation and Language · Computer Science 2022-07-14 Pavel Blinov , Arina Reshetnikova , Aleksandr Nesterov , Galina Zubkova , Vladimir Kokh

From Raw Corpora to Domain Benchmarks: Automated Evaluation of LLM Domain Expertise

Accurate domain-specific benchmarking of LLMs is essential, specifically in domains with direct implications for humans, such as law, healthcare, and education. However, existing benchmarks are documented to be contaminated and are based on…

Computation and Language · Computer Science 2026-03-09 Nitin Sharma , Thomas Wolfers , Çağatay Yıldız

Toxicity-Aware Few-Shot Prompting for Low-Resource Singlish Translation

As online communication increasingly incorporates under-represented languages and colloquial dialects, standard translation systems often fail to preserve local slang, code-mixing, and culturally embedded markers of harmful speech.…

Computation and Language · Computer Science 2025-07-17 Ziyu Ge , Gabriel Chua , Leanne Tan , Roy Ka-Wei Lee

When a Language Question Is at Stake. A Revisited Approach to Label Sensitive Content

Many under-resourced languages require high-quality datasets for specific tasks such as offensive language detection, disinformation, or misinformation identification. However, the intricacies of the content may have a detrimental effect on…

Computation and Language · Computer Science 2023-11-20 Stetsenko Daria

Toxicity of the Commons: Curating Open-Source Pre-Training Data

Open-source large language models are becoming increasingly available and popular among researchers and practitioners. While significant progress has been made on open-weight models, open training data is a practice yet to be adopted by the…

Computation and Language · Computer Science 2024-11-19 Catherine Arnett , Eliot Jones , Ivan P. Yamshchikov , Pierre-Carl Langlais