Related papers: Toxicity Classification in Ukrainian

UniTrans: Unifying Model Transfer and Data Transfer for Cross-Lingual Named Entity Recognition with Unlabeled Data

Prior works in cross-lingual named entity recognition (NER) with no/little labeled data fall into two primary categories: model transfer based and data transfer based methods. In this paper we find that both method types can complement each…

Computation and Language · Computer Science 2020-07-16 Qianhui Wu , Zijia Lin , Börje F. Karlsson , Biqing Huang , Jian-Guang Lou

Unveiling the Implicit Toxicity in Large Language Models

The open-endedness of large language models (LLMs) combined with their impressive capabilities may lead to new safety issues when being exploited for malicious use. While recent studies primarily focus on probing toxic outputs that can be…

Computation and Language · Computer Science 2023-11-30 Jiaxin Wen , Pei Ke , Hao Sun , Zhexin Zhang , Chengfei Li , Jinfeng Bai , Minlie Huang

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. We manually audit the…

Computation and Language · Computer Science 2022-02-22 Julia Kreutzer , Isaac Caswell , Lisa Wang , Ahsan Wahab , Daan van Esch , Nasanbayar Ulzii-Orshikh , Allahsera Tapo , Nishant Subramani , Artem Sokolov , Claytone Sikasote , Monang Setyawan , Supheakmungkol Sarin , Sokhar Samb , Benoît Sagot , Clara Rivera , Annette Rios , Isabel Papadimitriou , Salomey Osei , Pedro Ortiz Suarez , Iroro Orife , Kelechi Ogueji , Andre Niyongabo Rubungo , Toan Q. Nguyen , Mathias Müller , André Müller , Shamsuddeen Hassan Muhammad , Nanda Muhammad , Ayanda Mnyakeni , Jamshidbek Mirzakhalov , Tapiwanashe Matangira , Colin Leong , Nze Lawson , Sneha Kudugunta , Yacine Jernite , Mathias Jenny , Orhan Firat , Bonaventure F. P. Dossou , Sakhile Dlamini , Nisansa de Silva , Sakine Çabuk Ballı , Stella Biderman , Alessia Battisti , Ahmed Baruwa , Ankur Bapna , Pallavi Baljekar , Israel Abebe Azime , Ayodele Awokoya , Duygu Ataman , Orevaoghene Ahia , Oghenefego Ahia , Sweta Agrawal , Mofetoluwa Adeyemi

Benchmarking Multimodal Models for Ukrainian Language Understanding Across Academic and Cultural Domains

While the evaluation of multimodal English-centric models is an active area of research with numerous benchmarks, there is a profound lack of benchmarks or evaluation suites for low- and mid-resource languages. We introduce ZNO-Vision, a…

Computation and Language · Computer Science 2024-11-25 Yurii Paniv , Artur Kiulian , Dmytro Chaplynskyi , Mykola Khandoga , Anton Polishko , Tetiana Bas , Guillermo Gabrielli

Empirical Analysis of Multi-Task Learning for Reducing Model Bias in Toxic Comment Detection

With the recent rise of toxicity in online conversations on social media platforms, using modern machine learning algorithms for toxic comment detection has become a central focus of many online applications. Researchers and companies have…

Artificial Intelligence · Computer Science 2020-03-30 Ameya Vaidya , Feng Mai , Yue Ning

Towards non-toxic landscapes: Automatic toxic comment detection using DNN

The spectacular expansion of the Internet has led to the development of a new research problem in the field of natural language processing: automatic toxic comment detection, since many countries prohibit hate speech in public media. There…

Machine Learning · Computer Science 2020-09-18 Ashwin Geet D'Sa , Irina Illina , Dominique Fohr

Neural Cross-Lingual Named Entity Recognition with Minimal Resources

For languages with no annotated resources, unsupervised transfer of natural language processing models such as named-entity recognition (NER) from resource-rich languages would be an appealing capability. However, differences in words and…

Computation and Language · Computer Science 2018-09-13 Jiateng Xie , Zhilin Yang , Graham Neubig , Noah A. Smith , Jaime Carbonell

Be My Donor. Transfer the NLP Datasets Between the Languages Using LLM

In this work, we investigated how one can use the LLM to transfer the dataset and its annotation from one language to another. This is crucial since sharing the knowledge between different languages could boost certain underresourced…

Computation and Language · Computer Science 2024-10-21 Dmitrii Popov , Egor Terentev , Igor Buyanov

Toxicity Detection towards Adaptability to Changing Perturbations

Toxicity detection is crucial for maintaining the peace of the society. While existing methods perform well on normal toxic contents or those generated by specific perturbation methods, they are vulnerable to evolving perturbation patterns.…

Cryptography and Security · Computer Science 2025-03-05 Hankun Kang , Jianhao Chen , Yongqi Li , Xin Miao , Mayi Xu , Ming Zhong , Yuanyuan Zhu , Tieyun Qian

From Bytes to Borsch: Fine-Tuning Gemma and Mistral for the Ukrainian Language Representation

In the rapidly advancing field of AI and NLP, generative large language models (LLMs) stand at the forefront of innovation, showcasing unparalleled abilities in text understanding and generation. However, the limited representation of…

Computation and Language · Computer Science 2024-04-16 Artur Kiulian , Anton Polishko , Mykola Khandoga , Oryna Chubych , Jack Connor , Raghav Ravishankar , Adarsh Shirawalmath

Labeling Free-text Data using Language Model Ensembles

Free-text responses are commonly collected in psychological studies, providing rich qualitative insights that quantitative measures may not capture. Labeling curated topics of research interest in free-text data by multiple trained human…

Computation and Language · Computer Science 2025-09-29 Jiaxing Qiu , Dongliang Guo , Natalie Papini , Noelle Peace , Hannah F. Fitterman-Harris , Cheri A. Levinson , Tom Hartvigsen , Teague R. Henry

Cross-lingual Candidate Search for Biomedical Concept Normalization

Biomedical concept normalization links concept mentions in texts to a semantically equivalent concept in a biomedical knowledge base. This task is challenging as concepts can have different expressions in natural languages, e.g.…

Computation and Language · Computer Science 2018-07-10 Roland Roller , Madeleine Kittner , Dirk Weissenborn , Ulf Leser

ToxiTrace: Gradient-Aligned Training for Explainable Chinese Toxicity Detection

Existing Chinese toxic content detection methods mainly target sentence-level classification but often fail to provide readable and contiguous toxic evidence spans. We propose \textbf{ToxiTrace}, an explainability-oriented method for…

Computation and Language · Computer Science 2026-04-15 Boyang Li , Hongzhe Shou , Yuanyuan Liang , Jingbin Zhang , Fang Zhou

Aligned Probing: Relating Toxic Behavior and Model Internals

We introduce aligned probing, a novel interpretability framework that aligns the behavior of language models (LMs), based on their outputs, and their internal representations (internals). Using this framework, we examine over 20 OLMo,…

Computation and Language · Computer Science 2025-09-25 Andreas Waldis , Vagrant Gautam , Anne Lauscher , Dietrich Klakow , Iryna Gurevych

Large Language Models for Healthcare Text Classification: A Systematic Review

Large Language Models (LLMs) have fundamentally transformed approaches to Natural Language Processing (NLP) tasks across diverse domains. In healthcare, accurate and cost-efficient text classification is crucial, whether for clinical notes…

Computation and Language · Computer Science 2026-02-16 Hajar Sakai , Sarah S. Lam

Fake news detection is a challenging task aiming to reduce human time and effort to check the truthfulness of news. Automated approaches to combat fake news, however, are limited by the lack of labeled benchmark datasets, especially in…

Computation and Language · Computer Science 2021-03-02 Inna Vogel , Jeong-Eun Choi , Meghana Meghana

Method of the coherence evaluation of Ukrainian text

Due to the growing role of the SEO technologies, it is necessary to perform an automated analysis of the article's quality. Such approach helps both to return the most intelligible pages for the user's query and to raise the web sites…

Computation and Language · Computer Science 2020-11-03 S. D. Pogorilyy , A. A. Kramov

Enhancing LLM-based Hatred and Toxicity Detection with Meta-Toxic Knowledge Graph

The rapid growth of social media platforms has raised significant concerns regarding online content toxicity. When Large Language Models (LLMs) are used for toxicity detection, two key challenges emerge: 1) the absence of domain-specific…

Computation and Language · Computer Science 2025-06-03 Yibo Zhao , Jiapeng Zhu , Can Xu , Yao Liu , Xiang Li

FreeTransfer-X: Safe and Label-Free Cross-Lingual Transfer from Off-the-Shelf Models

Cross-lingual transfer (CLT) is of various applications. However, labeled cross-lingual corpus is expensive or even inaccessible, especially in the fields where labels are private, such as diagnostic results of symptoms in medicine and user…

Computation and Language · Computer Science 2022-06-15 Yinpeng Guo , Liangyou Li , Xin Jiang , Qun Liu

An Experimental Comparison of the Most Popular Approaches to Fake News Detection

In recent years, fake news detection has received increasing attention in public debate and scientific research. Despite advances in detection techniques, the production and spread of false information have become more sophisticated, driven…

Computation and Language · Computer Science 2026-03-27 Pietro Dell'Oglio , Alessandro Bondielli , Francesco Marcelloni , Lucia C. Passaro