English
Related papers

Related papers: Toxicity Classification in Ukrainian

200 papers

As large language models are increasingly deployed across diverse real-world applications, extending automated evaluation beyond English has become a critical challenge. Existing evaluation approaches are predominantly English-focused, and…

Computation and Language · Computer Science 2026-03-20 Ivaxi Sheth , Zeno Jonke , Amin Mantrach , Saab Mansour

Text classification is crucial for applications such as sentiment analysis and toxic text filtering, but it still faces challenges due to the complexity and ambiguity of natural language. Recent advancements in deep learning, particularly…

Computation and Language · Computer Science 2024-08-29 Lingyu Gao

Text toxicity detection systems exhibit significant biases, producing disproportionate rates of false positives on samples mentioning demographic groups. But what about toxicity detection in speech? To investigate the extent to which…

In an era of rapidly evolving internet technology, the surge in multimodal content, including videos, has expanded the horizons of online communication. However, the detection of toxic content in this diverse landscape, particularly in…

Artificial Intelligence · Computer Science 2024-07-16 Krishanu Maity , A. S. Poornash , Sriparna Saha , Pushpak Bhattacharyya

The first step in any NLP pipeline is to split the text into individual tokens. The most obvious and straightforward approach is to use words as tokens. However, given a large text corpus, representing all the words is not efficient in…

Computation and Language · Computer Science 2021-09-30 Zaid Alyafeai , Maged S. Al-shaibani , Mustafa Ghaleb , Irfan Ahmad

Computational social science (CSS) practitioners often rely on human-labeled data to fine-tune supervised text classifiers. We assess the potential for researchers to augment or replace human-generated training data with surrogate training…

Computation and Language · Computer Science 2024-06-26 Nicholas Pangakis , Samuel Wolken

The proliferation of online toxic speech is a pertinent problem posing threats to demographic groups. While explicit toxic speech contains offensive lexical signals, implicit one consists of coded or indirect language. Therefore, it is…

Computation and Language · Computer Science 2024-05-21 Nhat M. Hoang , Xuan Long Do , Duc Anh Do , Duc Anh Vu , Luu Anh Tuan

A common way to explore text corpora is through low-dimensional projections of the documents, where one hopes that thematically similar documents will be clustered together in the projected space. However, popular algorithms for…

Computation and Language · Computer Science 2023-08-04 Charumathi Badrinath , Weiwei Pan , Finale Doshi-Velez

The increasing accessibility of the internet facilitated social media usage and encouraged individuals to express their opinions liberally. Nevertheless, it also creates a place for content polluters to disseminate offensive posts or…

Computation and Language · Computer Science 2021-03-02 Omar Sharif , Eftekhar Hossain , Mohammed Moshiul Hoque

Objective: Causality mining is an active research area, which requires the application of state-of-the-art natural language processing techniques. In the healthcare domain, medical experts create clinical text to overcome the limitation of…

The application of large language models (LLMs) to chemistry is frequently hampered by a "tokenization bottleneck", where tokenizers tuned on general-domain text tend to fragment chemical representations such as SMILES into semantically…

Computation and Language · Computer Science 2025-11-19 Prathamesh Kalamkar , Ned Letcher , Meissane Chami , Sahger Lad , Shayan Mohanty , Prasanna Pendse

Language is a deep-rooted means of perpetration of stereotypes and discrimination. Large Language Models (LLMs), now a pervasive technology in our everyday lives, can cause extensive harm when prone to generating toxic responses. The…

Software Engineering · Computer Science 2026-02-06 Simone Corbo , Luca Bancale , Valeria De Gennaro , Livia Lestingi , Vincenzo Scotti , Matteo Camilli

The widespread use of human-like text from Large Language Models (LLMs) necessitates the development of robust detection systems. However, progress is limited by a critical lack of suitable training data; existing datasets are often…

Computation and Language · Computer Science 2025-09-26 Irina Tolstykh , Aleksandra Tsybina , Sergey Yakubson , Maksim Kuprashevich

We present a supervised learning algorithm for text categorization which has brought the team of authors the 2nd place in the text categorization division of the 2012 Cybersecurity Data Mining Competition (CDMC'2012) and a 3rd prize…

Information Retrieval · Computer Science 2013-07-11 Hubert Haoyang Duan , Vladimir Pestov , Varun Singla

For high-resource languages like English, text classification is a well-studied task. The performance of modern NLP models easily achieves an accuracy of more than 90% in many standard datasets for text classification in English (Xie et…

Computation and Language · Computer Science 2022-06-06 Dawei Zhu , Michael A. Hedderich , Fangzhou Zhai , David Ifeoluwa Adelani , Dietrich Klakow

The need for raw large raw corpora has dramatically increased in recent years with the introduction of transfer learning and semi-supervised learning methods to Natural Language Processing. And while there have been some recent attempts to…

Computation and Language · Computer Science 2022-01-19 Julien Abadji , Pedro Ortiz Suarez , Laurent Romary , Benoît Sagot

In machine learning, contamination refers to situations where testing data leak into the training set. The issue is particularly relevant for the evaluation of the performance of Large Language Models (LLMs), which are generally trained on…

Computation and Language · Computer Science 2025-06-23 Nicolas Yax , Pierre-Yves Oudeyer , Stefano Palminteri

In cross-lingual text classification, one seeks to exploit labeled data from one language to train a text classification model that can then be applied to a completely different language. Recent multilingual representation models have made…

Computation and Language · Computer Science 2020-07-31 Xin Dong , Yaxin Zhu , Yupeng Zhang , Zuohui Fu , Dongkuan Xu , Sen Yang , Gerard de Melo

Moderation is crucial to promoting healthy on-line discussions. Although several `toxicity' detection datasets and models have been published, most of them ignore the context of the posts, implicitly assuming that comments maybe judged…

Computation and Language · Computer Science 2020-06-02 John Pavlopoulos , Jeffrey Sorensen , Lucas Dixon , Nithum Thain , Ion Androutsopoulos

Despite the recent successes of transformer-based models in terms of effectiveness on a variety of tasks, their decisions often remain opaque to humans. Explanations are particularly important for tasks like offensive language or toxicity…

Computation and Language · Computer Science 2021-03-03 Tong Xiang , Sean MacAvaney , Eugene Yang , Nazli Goharian