Related papers: Toxicity Classification in Ukrainian

Cross-Lingual LLM-Judge Transfer via Evaluation Decomposition

As large language models are increasingly deployed across diverse real-world applications, extending automated evaluation beyond English has become a critical challenge. Existing evaluation approaches are predominantly English-focused, and…

Computation and Language · Computer Science 2026-03-20 Ivaxi Sheth , Zeno Jonke , Amin Mantrach , Saab Mansour

Harnessing the Intrinsic Knowledge of Pretrained Language Models for Challenging Text Classification Settings

Text classification is crucial for applications such as sentiment analysis and toxic text filtering, but it still faces challenges due to the complexity and ambiguity of natural language. Recent advancements in deep learning, particularly…

Computation and Language · Computer Science 2024-08-29 Lingyu Gao

On the Role of Speech Data in Reducing Toxicity Detection Bias

Text toxicity detection systems exhibit significant biases, producing disproportionate rates of false positives on samples mentioning demographic groups. But what about toxicity detection in speech? To investigate the extent to which…

Computation and Language · Computer Science 2025-05-19 Samuel J. Bell , Mariano Coria Meglioli , Megan Richards , Eduardo Sánchez , Christophe Ropers , Skyler Wang , Adina Williams , Levent Sagun , Marta R. Costa-jussà

ToxVidLM: A Multimodal Framework for Toxicity Detection in Code-Mixed Videos

In an era of rapidly evolving internet technology, the surge in multimodal content, including videos, has expanded the horizons of online communication. However, the detection of toxic content in this diverse landscape, particularly in…

Artificial Intelligence · Computer Science 2024-07-16 Krishanu Maity , A. S. Poornash , Sriparna Saha , Pushpak Bhattacharyya

Evaluating Various Tokenizers for Arabic Text Classification

The first step in any NLP pipeline is to split the text into individual tokens. The most obvious and straightforward approach is to use words as tokens. However, given a large text corpus, representing all the words is not efficient in…

Computation and Language · Computer Science 2021-09-30 Zaid Alyafeai , Maged S. Al-shaibani , Mustafa Ghaleb , Irfan Ahmad

Knowledge Distillation in Automated Annotation: Supervised Text Classification with LLM-Generated Training Labels

Computational social science (CSS) practitioners often rely on human-labeled data to fine-tune supervised text classifiers. We assess the potential for researchers to augment or replace human-generated training data with surrogate training…

Computation and Language · Computer Science 2024-06-26 Nicholas Pangakis , Samuel Wolken

ToXCL: A Unified Framework for Toxic Speech Detection and Explanation

The proliferation of online toxic speech is a pertinent problem posing threats to demographic groups. While explicit toxic speech contains offensive lexical signals, implicit one consists of coded or indirect language. Therefore, it is…

Computation and Language · Computer Science 2024-05-21 Nhat M. Hoang , Xuan Long Do , Duc Anh Do , Duc Anh Vu , Luu Anh Tuan

SAP-sLDA: An Interpretable Interface for Exploring Unstructured Text

A common way to explore text corpora is through low-dimensional projections of the documents, where one hopes that thematically similar documents will be clustered together in the projected space. However, popular algorithms for…

Computation and Language · Computer Science 2023-08-04 Charumathi Badrinath , Weiwei Pan , Finale Doshi-Velez

NLP-CUET@DravidianLangTech-EACL2021: Offensive Language Detection from Multilingual Code-Mixed Text using Transformers

The increasing accessibility of the internet facilitated social media usage and encouraged individuals to express their opinions liberally. Nevertheless, it also creates a place for content polluters to disseminate offensive posts or…

Computation and Language · Computer Science 2021-03-02 Omar Sharif , Eftekhar Hossain , Mohammed Moshiul Hoque

A Practical Approach towards Causality Mining in Clinical Text using Active Transfer Learning

Objective: Causality mining is an active research area, which requires the application of state-of-the-art natural language processing techniques. In the healthcare domain, medical experts create clinical text to overcome the limitation of…

Computation and Language · Computer Science 2021-10-13 Musarrat Hussain , Fahad Ahmed Satti , Jamil Hussain , Taqdir Ali , Syed Imran Ali , Hafiz Syed Muhammad Bilal , Gwang Hoon Park , Sungyoung Lee

The Tokenization Bottleneck: How Vocabulary Extension Improves Chemistry Representation Learning in Pretrained Language Models

The application of large language models (LLMs) to chemistry is frequently hampered by a "tokenization bottleneck", where tokenizers tuned on general-domain text tend to fragment chemical representations such as SMILES into semantically…

Computation and Language · Computer Science 2025-11-19 Prathamesh Kalamkar , Ned Letcher , Meissane Chami , Sahger Lad , Shayan Mohanty , Prasanna Pendse

How Toxic Can You Get? Search-based Toxicity Testing for Large Language Models

Language is a deep-rooted means of perpetration of stereotypes and discrimination. Large Language Models (LLMs), now a pervasive technology in our everyday lives, can cause extensive harm when prone to generating toxic responses. The…

Software Engineering · Computer Science 2026-02-06 Simone Corbo , Luca Bancale , Valeria De Gennaro , Livia Lestingi , Vincenzo Scotti , Matteo Camilli

LLMTrace: A Corpus for Classification and Fine-Grained Localization of AI-Written Text

The widespread use of human-like text from Large Language Models (LLMs) necessitates the development of robust detection systems. However, progress is limited by a critical lack of suitable training data; existing datasets are often…

Computation and Language · Computer Science 2025-09-26 Irina Tolstykh , Aleksandra Tsybina , Sergey Yakubson , Maksim Kuprashevich

Text Categorization via Similarity Search: An Efficient and Effective Novel Algorithm

We present a supervised learning algorithm for text categorization which has brought the team of authors the 2nd place in the text categorization division of the 2012 Cybersecurity Data Mining Competition (CDMC'2012) and a 3rd prize…

Information Retrieval · Computer Science 2013-07-11 Hubert Haoyang Duan , Vladimir Pestov , Varun Singla

Task-Adaptive Pre-Training for Boosting Learning With Noisy Labels: A Study on Text Classification for African Languages

For high-resource languages like English, text classification is a well-studied task. The performance of modern NLP models easily achieves an accuracy of more than 90% in many standard datasets for text classification in English (Xie et…

Computation and Language · Computer Science 2022-06-06 Dawei Zhu , Michael A. Hedderich , Fangzhou Zhai , David Ifeoluwa Adelani , Dietrich Klakow

Towards a Cleaner Document-Oriented Multilingual Crawled Corpus

The need for raw large raw corpora has dramatically increased in recent years with the introduction of transfer learning and semi-supervised learning methods to Natural Language Processing. And while there have been some recent attempts to…

Computation and Language · Computer Science 2022-01-19 Julien Abadji , Pedro Ortiz Suarez , Laurent Romary , Benoît Sagot

LogProber: Disentangling confidence from contamination in LLM responses

In machine learning, contamination refers to situations where testing data leak into the training set. The issue is particularly relevant for the evaluation of the performance of Large Language Models (LLMs), which are generally trained on…

Computation and Language · Computer Science 2025-06-23 Nicolas Yax , Pierre-Yves Oudeyer , Stefano Palminteri

Leveraging Adversarial Training in Self-Learning for Cross-Lingual Text Classification

In cross-lingual text classification, one seeks to exploit labeled data from one language to train a text classification model that can then be applied to a completely different language. Recent multilingual representation models have made…

Computation and Language · Computer Science 2020-07-31 Xin Dong , Yaxin Zhu , Yupeng Zhang , Zuohui Fu , Dongkuan Xu , Sen Yang , Gerard de Melo

Toxicity Detection: Does Context Really Matter?

Moderation is crucial to promoting healthy on-line discussions. Although several `toxicity' detection datasets and models have been published, most of them ignore the context of the posts, implicitly assuming that comments maybe judged…

Computation and Language · Computer Science 2020-06-02 John Pavlopoulos , Jeffrey Sorensen , Lucas Dixon , Nithum Thain , Ion Androutsopoulos

ToxCCIn: Toxic Content Classification with Interpretability

Despite the recent successes of transformer-based models in terms of effectiveness on a variety of tasks, their decisions often remain opaque to humans. Explanations are particularly important for tasks like offensive language or toxicity…

Computation and Language · Computer Science 2021-03-03 Tong Xiang , Sean MacAvaney , Eugene Yang , Nazli Goharian