English
Related papers

Related papers: Toxicity Classification in Ukrainian

200 papers

Even with various regulations in place across countries and social media platforms (Government of India, 2021; European Parliament and Council of the European Union, 2022, digital abusive speech remains a significant issue. One potential…

Online media aim for reaching ever bigger audience and for attracting ever longer attention span. This competition creates an environment that rewards sensational, fake, and toxic news. To help limit their spread and impact, we propose and…

Computation and Language · Computer Science 2019-08-27 Yoan Dinkov , Ivan Koychev , Preslav Nakov

Toxicity has become a grave problem for many online communities and has been growing across many languages, including Russian. Hate speech creates an environment of intimidation, discrimination, and may even incite some real-world violence.…

Computation and Language · Computer Science 2020-10-23 Nadezhda Zueva , Madina Kabirova , Pavel Kalaidin

The development of monolingual language models for low and mid-resource languages continues to be hindered by the difficulty in sourcing high-quality training data. In this study, we present a novel cross-lingual vocabulary transfer…

Computation and Language · Computer Science 2024-08-09 François Remy , Pieter Delobelle , Hayastan Avetisyan , Alfiya Khabibullina , Miryam de Lhoneux , Thomas Demeester

Cross-lingual annotations of legislative texts enable us to explore major themes covered in multilingual legal data and are a key facilitator of semantic similarity when searching for similar documents. Multilingual probabilistic topic…

Information Retrieval · Computer Science 2019-12-02 Carlos Badenes-Olmedo , Jose-Luis Redondo-Garcia , Oscar Corcho

Cross-lingual Text Classification (CLC) consists of automatically classifying, according to a common set C of classes, documents each written in one of a set of languages L, and doing so more accurately than when naively classifying each…

Machine Learning · Computer Science 2021-09-22 Andrea Esuli , Alejandro Moreo , Fabrizio Sebastiani

Text alignment and text quality are critical to the accuracy of Machine Translation (MT) systems, some NLP tools, and any other text processing tasks requiring bilingual data. This research proposes a language independent bi-sentence…

Computation and Language · Computer Science 2015-10-16 Krzysztof Wołk

Cross-Language Information Retrieval (CLIR) and machine translation (MT) resources, such as dictionaries and parallel corpora, are scarce and hard to come by for special domains. Besides, these resources are just limited to a few languages,…

Computation and Language · Computer Science 2013-02-20 Sa Liu , Chengzhi Zhang

The volume of machine-generated content online has grown dramatically due to the widespread use of Large Language Models (LLMs), leading to new challenges for content moderation systems. Conventional content moderation classifiers, which…

Computation and Language · Computer Science 2026-05-26 Shaz Furniturewala , Arkaitz Zubiaga

To date, toxicity mitigation in language models has almost entirely been focused on single-language settings. As language models embrace multilingual capabilities, it's crucial our safety measures keep pace. Recognizing this research gap,…

Computation and Language · Computer Science 2024-05-31 Luiza Pozzobon , Patrick Lewis , Sara Hooker , Beyza Ermis

Automatic language identification is frequently framed as a multi-class classification problem. However, when creating digital corpora for less commonly written languages, it may be more appropriate to consider it a data mining problem. For…

Computation and Language · Computer Science 2025-03-11 Rasul Dent , Pedro Ortiz Suarez , Thibault Clérice , Benoît Sagot

Detecting toxic content using language models is crucial yet challenging. While substantial progress has been made in English, toxicity detection in French remains underdeveloped, primarily due to the lack of culturally relevant,…

Computation and Language · Computer Science 2026-04-21 Axel Delaval , Shujian Yang , Haicheng Wang , Han Qiu , Jialiang Lu

We present a corpus professionally annotated for grammatical error correction (GEC) and fluency edits in the Ukrainian language. To the best of our knowledge, this is the first GEC corpus for the Ukrainian language. We collected texts with…

Computation and Language · Computer Science 2022-11-09 Oleksiy Syvokon , Olena Nahorna

Recent advancements in Natural Language Processing (NLP) have fostered the development of Large Language Models (LLMs) that can solve an immense variety of tasks. One of the key aspects of their application is their ability to work with…

Computation and Language · Computer Science 2024-08-06 Igor Churin , Murat Apishev , Maria Tikhonova , Denis Shevelev , Aydar Bulatov , Yuri Kuratov , Sergej Averkiev , Alena Fenogenova

With surge in online platforms, there has been an upsurge in the user engagement on these platforms via comments and reactions. A large portion of such textual comments are abusive, rude and offensive to the audience. With machine learning…

Computation and Language · Computer Science 2021-08-17 Ayush Kumar , Pratik Kumar

The paper describes the open Russian medical language understanding benchmark covering several task types (classification, question answering, natural language inference, named entity recognition) on a number of novel text sets. Given the…

Computation and Language · Computer Science 2022-07-14 Pavel Blinov , Arina Reshetnikova , Aleksandr Nesterov , Galina Zubkova , Vladimir Kokh

Accurate domain-specific benchmarking of LLMs is essential, specifically in domains with direct implications for humans, such as law, healthcare, and education. However, existing benchmarks are documented to be contaminated and are based on…

Computation and Language · Computer Science 2026-03-09 Nitin Sharma , Thomas Wolfers , Çağatay Yıldız

As online communication increasingly incorporates under-represented languages and colloquial dialects, standard translation systems often fail to preserve local slang, code-mixing, and culturally embedded markers of harmful speech.…

Computation and Language · Computer Science 2025-07-17 Ziyu Ge , Gabriel Chua , Leanne Tan , Roy Ka-Wei Lee

Many under-resourced languages require high-quality datasets for specific tasks such as offensive language detection, disinformation, or misinformation identification. However, the intricacies of the content may have a detrimental effect on…

Computation and Language · Computer Science 2023-11-20 Stetsenko Daria

Open-source large language models are becoming increasingly available and popular among researchers and practitioners. While significant progress has been made on open-weight models, open training data is a practice yet to be adopted by the…

Computation and Language · Computer Science 2024-11-19 Catherine Arnett , Eliot Jones , Ivan P. Yamshchikov , Pierre-Carl Langlais