Related papers: Toxicity Classification in Ukrainian

Leashing the Inner Demons: Self-Detoxification for Language Models

Language models (LMs) can reproduce (or amplify) toxic language seen during training, which poses a risk to their practical application. In this paper, we conduct extensive experiments to study this phenomenon. We analyze the impact of…

Computation and Language · Computer Science 2022-03-08 Canwen Xu , Zexue He , Zhankui He , Julian McAuley

Challenges in Automated Debiasing for Toxic Language Detection

Biased associations have been a challenge in the development of classifiers for detecting toxic language, hindering both fairness and accuracy. As potential solutions, we investigate recently introduced debiasing methods for text…

Computation and Language · Computer Science 2021-02-02 Xuhui Zhou , Maarten Sap , Swabha Swayamdipta , Noah A. Smith , Yejin Choi

Are Structural Concepts Universal in Transformer Language Models? Towards Interpretable Cross-Lingual Generalization

Large language models (LLMs) have exhibited considerable cross-lingual generalization abilities, whereby they implicitly transfer knowledge across languages. However, the transfer is not equally successful for all languages, especially for…

Computation and Language · Computer Science 2023-12-25 Ningyu Xu , Qi Zhang , Jingting Ye , Menghan Zhang , Xuanjing Huang

Learning Crosslingual Word Embeddings without Bilingual Corpora

Crosslingual word embeddings represent lexical items from different languages in the same vector space, enabling transfer of NLP tools. However, previous attempts had expensive resource requirements, difficulty incorporating monolingual…

Computation and Language · Computer Science 2016-07-01 Long Duong , Hiroshi Kanayama , Tengfei Ma , Steven Bird , Trevor Cohn

Cross-Domain Toxic Spans Detection

Given the dynamic nature of toxic language use, automated methods for detecting toxic spans are likely to encounter distributional shift. To explore this phenomenon, we evaluate three approaches for detecting toxic spans under cross-domain…

Computation and Language · Computer Science 2023-06-19 Stefan F. Schouten , Baran Barbarestani , Wondimagegnhue Tufa , Piek Vossen , Ilia Markov

Multi-Legal-Bench: Evaluating LLMs on Legal Reasoning Across Jurisdictions, Languages, and Legal Traditions

Legal NLP benchmarks overwhelmingly evaluate a single language or aggregate tasks that differ fundamentally across jurisdictions, making cross-lingual comparison impossible. We introduce Multi-Legal-Bench, the first cross-jurisdictional…

Computation and Language · Computer Science 2026-05-29 Volodymyr Ovcharov

Detecting Languages Unintelligible to Multilingual Models through Local Structure Probes

Providing better language tools for low-resource and endangered languages is imperative for equitable growth. Recent progress with massively multilingual pretrained models has proven surprisingly effective at performing zero-shot transfer…

Computation and Language · Computer Science 2022-11-10 Louis Clouâtre , Prasanna Parthasarathi , Amal Zouaq , Sarath Chandar

Improving Romanian LLM Pretraining Data using Diversity and Quality Filtering

Large Language Models (LLMs) have recently exploded in popularity, often matching or outperforming human abilities on many tasks. One of the key factors in training LLMs is the availability and curation of high-quality data. Data quality is…

Computation and Language · Computer Science 2025-11-04 Vlad Negoita , Mihai Masala , Traian Rebedea

Revisiting Contextual Toxicity Detection in Conversations

Understanding toxicity in user conversations is undoubtedly an important problem. Addressing "covert" or implicit cases of toxicity is particularly hard and requires context. Very few previous studies have analysed the influence of…

Computation and Language · Computer Science 2022-10-19 Atijit Anuchitanukul , Julia Ive , Lucia Specia

Cross Language Text Classification via Subspace Co-Regularized Multi-View Learning

In many multilingual text classification problems, the documents in different languages often share the same set of categories. To reduce the labeling cost of training a classification model for each individual language, it is important to…

Computation and Language · Computer Science 2012-07-03 Yuhong Guo , Min Xiao

Comparing LLM Text Annotation Skills: A Study on Human Rights Violations in Social Media Data

In the era of increasingly sophisticated natural language processing (NLP) systems, large language models (LLMs) have demonstrated remarkable potential for diverse applications, including tasks requiring nuanced textual understanding and…

Computation and Language · Computer Science 2025-05-16 Poli Apollinaire Nemkova , Solomon Ubani , Mark V. Albert

A Survey of Toxic Comment Classification Methods

While in real life everyone behaves themselves at least to some extent, it is much more difficult to expect people to behave themselves on the internet, because there are few checks or consequences for posting something toxic to others.…

Computation and Language · Computer Science 2021-12-14 Kehan Wang , Jiaxi Yang , Hongjun Wu

UTNLP at SemEval-2021 Task 5: A Comparative Analysis of Toxic Span Detection using Attention-based, Named Entity Recognition, and Ensemble Models

Detecting which parts of a sentence contribute to that sentence's toxicity -- rather than providing a sentence-level verdict of hatefulness -- would increase the interpretability of models and allow human moderators to better understand the…

Computation and Language · Computer Science 2021-04-13 Alireza Salemi , Nazanin Sabri , Emad Kebriaei , Behnam Bahrak , Azadeh Shakery

Modeling subjectivity (by Mimicking Annotator Annotation) in toxic comment identification across diverse communities

The prevalence and impact of toxic discussions online have made content moderation crucial.Automated systems can play a vital role in identifying toxicity, and reducing the reliance on human moderation.Nevertheless, identifying toxic…

Artificial Intelligence · Computer Science 2023-11-02 Senjuti Dutta , Sid Mittal , Sherol Chen , Deepak Ramachandran , Ravi Rajakumar , Ian Kivlichan , Sunny Mak , Alena Butryna , Praveen Paritosh

Unsupervised Cross-lingual Transfer of Word Embedding Spaces

Cross-lingual transfer of word embeddings aims to establish the semantic mappings among words in different languages by learning the transformation functions over the corresponding word embedding spaces. Successfully solving this problem…

Computation and Language · Computer Science 2018-09-12 Ruochen Xu , Yiming Yang , Naoki Otani , Yuexin Wu

Just KIDDIN: Knowledge Infusion and Distillation for Detection of INdecent Memes

Toxicity identification in online multimodal environments remains a challenging task due to the complexity of contextual connections across modalities (e.g., textual and visual). In this paper, we propose a novel framework that integrates…

Machine Learning · Computer Science 2026-02-18 Rahul Garg , Trilok Padhi , Hemang Jain , Ugur Kursuncu , Ponnurangam Kumaraguru

Model Transfer for Tagging Low-resource Languages using a Bilingual Dictionary

Cross-lingual model transfer is a compelling and popular method for predicting annotations in a low-resource language, whereby parallel corpora provide a bridge to a high-resource language and its associated annotated corpora. However,…

Computation and Language · Computer Science 2017-05-02 Meng Fang , Trevor Cohn

Cross-Lingual Semantic Role Labeling with High-Quality Translated Training Corpus

Many efforts of research are devoted to semantic role labeling (SRL) which is crucial for natural language understanding. Supervised approaches have achieved impressing performances when large-scale corpora are available for resource-rich…

Computation and Language · Computer Science 2020-05-08 Hao Fei , Meishan Zhang , Donghong Ji

A Curriculum Learning Approach for Multi-domain Text Classification Using Keyword weight Ranking

Text classification is a very classic NLP task, but it has two prominent shortcomings: On the one hand, text classification is deeply domain-dependent. That is, a classifier trained on the corpus of one domain may not perform so well in…

Computation and Language · Computer Science 2022-10-28 Zilin Yuan , Yinghui Li , Yangning Li , Rui Xie , Wei Wu , Hai-Tao Zheng

ECLeKTic: a Novel Challenge Set for Evaluation of Cross-Lingual Knowledge Transfer

To achieve equitable performance across languages, large language models (LLMs) must be able to abstract knowledge beyond the language in which it was learnt. However, the current literature lacks reliable ways to measure LLMs' capability…

Computation and Language · Computer Science 2025-11-11 Omer Goldman , Uri Shaham , Dan Malkin , Sivan Eiger , Avinatan Hassidim , Yossi Matias , Joshua Maynez , Adi Mayrav Gilady , Jason Riesa , Shruti Rijhwani , Laura Rimell , Idan Szpektor , Reut Tsarfaty , Matan Eyal