English
Related papers

Related papers: Toxicity Classification in Ukrainian

200 papers

Generic `toxicity' classifiers continue to be used for evaluating the potential for harm in natural language generation, despite mounting evidence of their shortcomings. We consider the challenge of measuring misogyny in natural language…

Computation and Language · Computer Science 2023-12-07 Aaron J. Snoswell , Lucinda Nelson , Hao Xue , Flora D. Salim , Nicolas Suzor , Jean Burgess

Recent observations have underscored a disparity between the inflated benchmark scores and the actual performance of LLMs, raising concerns about potential contamination of evaluation benchmarks. This issue is especially critical for…

Computation and Language · Computer Science 2024-04-05 Chunyuan Deng , Yilun Zhao , Xiangru Tang , Mark Gerstein , Arman Cohan

We introduce MULTI-EURLEX, a new multilingual dataset for topic classification of legal documents. The dataset comprises 65k European Union (EU) laws, officially translated in 23 languages, annotated with multiple labels from the EUROVOC…

Computation and Language · Computer Science 2021-09-08 Ilias Chalkidis , Manos Fergadiotis , Ion Androutsopoulos

As language models (LMs) deliver increasing performance on a range of NLP tasks, probing classifiers have become an indispensable technique in the effort to better understand their inner workings. A typical setup involves (1) defining an…

Computation and Language · Computer Science 2024-08-01 Charles Jin , Martin Rinard

In this study we address the problem of automated word stress detection in Russian using character level models and no part-speech-taggers. We use a simple bidirectional RNN with LSTM nodes and achieve the accuracy of 90% or higher. We…

Computation and Language · Computer Science 2019-07-15 Maria Ponomareva , Kirill Milintsevich , Ekaterina Chernyak , Anatoly Starostin

Online platforms have become an increasingly prominent means of communication. Despite the obvious benefits to the expanded distribution of content, the last decade has resulted in disturbing toxic communication, such as cyberbullying and…

Social and Information Networks · Computer Science 2023-09-04 Amit Sheth , Valerie L. Shalin , Ugur Kursuncu

Not all topics are equally "flammable" in terms of toxicity: a calm discussion of turtles or fishing less often fuels inappropriate toxic dialogues than a discussion of politics or sexual minorities. We define a set of sensitive topics that…

Computation and Language · Computer Science 2021-03-10 Nikolay Babakov , Varvara Logacheva , Olga Kozlova , Nikita Semenov , Alexander Panchenko

Collecting annotations from human raters often results in a trade-off between the quantity of labels one wishes to gather and the quality of these labels. As such, it is often only possible to gather a small amount of high-quality labels.…

Machine Learning · Computer Science 2021-10-05 Neel Nanda , Jonathan Uesato , Sven Gowal

Tokenizer fertility the number of tokens per word imposes a hidden cost on non-English NLP. We measure fertility for ten foundation models across 25 European languages on parallel text, producing the first controlled tokenizer tax map for…

Computation and Language · Computer Science 2026-05-26 Volodymyr Ovcharov

Conversational data is essential in psychology because it can help researchers understand individuals cognitive processes, emotions, and behaviors. Utterance labelling is a common strategy for analyzing this type of data. The development of…

Computation and Language · Computer Science 2022-08-16 Maria Laricheva , Chiyu Zhang , Yan Liu , Guanyu Chen , Terence Tracey , Richard Young , Giuseppe Carenini

The automated detection of hallucinations and training data contamination is pivotal to the safe deployment of Large Language Models (LLMs). These tasks are particularly challenging in settings where no access to model internals is…

Machine Learning · Computer Science 2025-10-01 Guy Bar-Shalom , Fabrizio Frasca , Derek Lim , Yoav Gelberg , Yftah Ziser , Ran El-Yaniv , Gal Chechik , Haggai Maron

This article examines semantic shifts in psychological concepts across scientific and popular media discourse using methods of distributional semantics applied to Russian-language corpora. Two corpora were compiled: a scientific corpus of…

Computation and Language · Computer Science 2026-04-02 Orlova Anastasia

Most existing approaches to disfluency detection heavily rely on human-annotated corpora, which is expensive to obtain in practice. There have been several proposals to alleviate this issue with, for instance, self-supervised learning…

Computation and Language · Computer Science 2020-10-30 Shaolei Wang , Zhongyuan Wang , Wanxiang Che , Ting Liu

Extremist groups develop complex in-group language, also referred to as cryptolects, to exclude or mislead outsiders. We investigate the ability of current language technologies to detect and interpret the cryptolects of two online…

Computation and Language · Computer Science 2025-06-09 Christine de Kock , Arij Riabi , Zeerak Talat , Michael Sejr Schlichtkrull , Pranava Madhyastha , Ed Hovy

Toxic language is one of the major barrier to safe online participation, yet robust mitigation tools are scarce for African languages. This study addresses this critical gap by investigating automatic text detoxification (toxic to neutral…

Computation and Language · Computer Science 2026-01-12 Abayomi O. Agbeyangi

Prior works on detoxification are scattered in the sense that they do not cover all aspects of detoxification needed in a real-world scenario. Notably, prior works restrict the task of developing detoxification models to only a seen subset…

Machine Learning · Computer Science 2024-10-07 Md Tawkat Islam Khondaker , Muhammad Abdul-Mageed , Laks V. S. Lakshmanan

Current LLMs are generally aligned to follow safety requirements and tend to refuse toxic prompts. However, LLMs can fail to refuse toxic prompts or be overcautious and refuse benign examples. In addition, state-of-the-art toxicity…

Computation and Language · Computer Science 2024-11-11 Zhanhao Hu , Julien Piet , Geng Zhao , Jiantao Jiao , David Wagner

In this paper, we propose a model-agnostic cost-effective approach to developing bilingual base large language models (LLMs) to support English and any target language. The method includes vocabulary expansion, initialization of new…

Large language models can produce toxic or inappropriate text even for benign inputs, creating risks when deployed at scale. Detoxification is therefore important for safety and user trust, particularly when we want to reduce harmful…

Computation and Language · Computer Science 2026-02-04 Baturay Saglam , Dionysis Kalogerias

Cross-lingual information retrieval is a challenging task in the absence of aligned parallel corpora. In this paper, we address this problem by considering topically aligned corpora designed for evaluating an IR setup. To emphasize, we…

Information Retrieval · Computer Science 2018-04-13 Mitodru Niyogi , Kripabandhu Ghosh , Arnab Bhattacharya
‹ Prev 1 8 9 10 Next ›