Related papers: CMD: a framework for Context-aware Model self-Deto…

Language Model Detoxification in Dialogue with Contextualized Stance Control

To reduce the toxic degeneration in a pretrained Language Model (LM), previous work on Language Model detoxification has focused on reducing the toxicity of the generation itself (self-toxicity) without consideration of the context. As a…

Computation and Language · Computer Science 2023-01-26 Jing Qian , Xifeng Yan

Self-Detoxifying Language Models via Toxification Reversal

Language model detoxification aims to minimize the risk of generating offensive or harmful content in pretrained language models (PLMs) for safer deployment. Existing methods can be roughly categorized as finetuning-based and…

Computation and Language · Computer Science 2023-10-17 Chak Tou Leong , Yi Cheng , Jiashuo Wang , Jian Wang , Wenjie Li

Detoxification of Large Language Models through Output-layer Fusion with a Calibration Model

Existing approaches for Large language model (LLM) detoxification generally rely on training on large-scale non-toxic or human-annotated preference data, designing prompts to instruct the LLM to generate safe content, or modifying the model…

Computation and Language · Computer Science 2025-06-03 Yuanhe Tian , Mingjie Deng , Guoqing Jin , Yan Song

Cleansing the Artificial Mind: A Self-Reflective Detoxification Framework for Large Language Models

Recent breakthroughs in Large Language Models (LLMs) have revealed remarkable generative capabilities and emerging self-regulatory mechanisms, including self-correction and self-rewarding. However, current detoxification techniques rarely…

Computation and Language · Computer Science 2026-01-21 Kaituo Zhang , Zhimeng Jiang , Na Zou

Reward Modeling for Mitigating Toxicity in Transformer-based Language Models

Transformer-based language models are able to generate fluent text and be efficiently adapted across various natural language generation tasks. However, language models that are pretrained on large unlabeled web text corpora have been shown…

Computation and Language · Computer Science 2022-07-28 Farshid Faal , Ketra Schmitt , Jia Yuan Yu

Leashing the Inner Demons: Self-Detoxification for Language Models

Language models (LMs) can reproduce (or amplify) toxic language seen during training, which poses a risk to their practical application. In this paper, we conduct extensive experiments to study this phenomenon. We analyze the impact of…

Computation and Language · Computer Science 2022-03-08 Canwen Xu , Zexue He , Zhankui He , Julian McAuley

DetoxLLM: A Framework for Detoxification with Explanations

Prior works on detoxification are scattered in the sense that they do not cover all aspects of detoxification needed in a real-world scenario. Notably, prior works restrict the task of developing detoxification models to only a seen subset…

Machine Learning · Computer Science 2024-10-07 Md Tawkat Islam Khondaker , Muhammad Abdul-Mageed , Laks V. S. Lakshmanan

Language Detoxification with Attribute-Discriminative Latent Space

Transformer-based Language Models (LMs) have achieved impressive results on natural language understanding tasks, but they can also generate toxic text such as insults, threats, and profanity, limiting their real-world applications. To…

Computation and Language · Computer Science 2023-07-06 Jin Myung Kwak , Minseon Kim , Sung Ju Hwang

Detoxification for LLM: From Dataset Itself

Existing detoxification methods for large language models mainly focus on post-training stage or inference time, while few tackle the source of toxicity, namely, the dataset itself. Such training-based or controllable decoding approaches…

Computation and Language · Computer Science 2026-04-22 Wei Shao , Yihang Wang , Gaoyu Zhu , Ziqiang Cheng , Lei Yu , Jiafeng Guo , Xueqi Cheng

Exploring the Limits of Domain-Adaptive Training for Detoxifying Large-Scale Language Models

Pre-trained language models (LMs) are shown to easily generate toxic language. In this work, we systematically explore domain-adaptive training to reduce the toxicity of language models. We conduct this study on three dimensions: training…

Computation and Language · Computer Science 2022-10-25 Boxin Wang , Wei Ping , Chaowei Xiao , Peng Xu , Mostofa Patwary , Mohammad Shoeybi , Bo Li , Anima Anandkumar , Bryan Catanzaro

Exploring Cross-lingual Textual Style Transfer with Large Multilingual Language Models

Detoxification is a task of generating text in polite style while preserving meaning and fluency of the original toxic text. Existing detoxification methods are designed to work in one exact language. This work investigates multilingual and…

Computation and Language · Computer Science 2022-06-07 Daniil Moskovskiy , Daryna Dementieva , Alexander Panchenko

DSCD: Large Language Model Detoxification with Self-Constrained Decoding

Detoxification in large language models (LLMs) remains a significant research challenge. Existing decoding detoxification methods are all based on external constraints, which require additional resource overhead and lose generation fluency.…

Computation and Language · Computer Science 2025-10-16 Ming Dong , Jinkui Zhang , Bolong Zheng , Xinhui Tu , Po Hu , Tingting He

CFL: Causally Fair Language Models Through Token-level Attribute Controlled Generation

We propose a method to control the attributes of Language Models (LMs) for the text generation task using Causal Average Treatment Effect (ATE) scores and counterfactual augmentation. We explore this method, in the context of LM…

Computation and Language · Computer Science 2023-10-04 Rahul Madhavan , Rishabh Garg , Kahini Wadhawan , Sameep Mehta

Text Detoxification: Data Efficiency, Semantic Preservation and Model Generalization

The widespread dissemination of toxic content on social media poses a serious threat to both online environments and public discourse, highlighting the urgent need for detoxification methods that effectively remove toxicity while preserving…

Machine Learning · Computer Science 2025-07-08 Jing Yu , Yibo Zhao , Jiapeng Zhu , Wenming Shao , Bo Pang , Zhao Zhang , Xiang Li

Breaking mBad! Supervised Fine-tuning for Cross-Lingual Detoxification

As large language models (LLMs) become increasingly prevalent in global applications, ensuring that they are toxicity-free across diverse linguistic contexts remains a critical challenge. We explore "Cross-lingual Detoxification", a…

Computation and Language · Computer Science 2025-10-24 Himanshu Beniwal , Youngwoo Kim , Maarten Sap , Soham Dan , Thomas Hartvigsen

DiffuDetox: A Mixed Diffusion Model for Text Detoxification

Text detoxification is a conditional text generation task aiming to remove offensive content from toxic text. It is highly useful for online forums and social media, where offensive content is frequently encountered. Intuitively, there are…

Computation and Language · Computer Science 2023-06-16 Griffin Floto , Mohammad Mahdi Abdollah Pour , Parsa Farinneya , Zhenwei Tang , Ali Pesaranghader , Manasa Bharadwaj , Scott Sanner

Exploring Methods for Cross-lingual Text Style Transfer: The Case of Text Detoxification

Text detoxification is the task of transferring the style of text from toxic to neutral. While here are approaches yielding promising results in monolingual setup, e.g., (Dale et al., 2021; Hallinan et al., 2022), cross-lingual transfer for…

Computation and Language · Computer Science 2023-11-27 Daryna Dementieva , Daniil Moskovskiy , David Dale , Alexander Panchenko

Detoxifying Language Models Risks Marginalizing Minority Voices

Language models (LMs) must be both safe and equitable to be responsibly deployed in practice. With safety in mind, numerous detoxification techniques (e.g., Dathathri et al. 2020; Krause et al. 2020) have been proposed to mitigate toxic LM…

Computation and Language · Computer Science 2021-04-14 Albert Xu , Eshaan Pathak , Eric Wallace , Suchin Gururangan , Maarten Sap , Dan Klein

Test-Time Detoxification without Training or Learning Anything

Large language models can produce toxic or inappropriate text even for benign inputs, creating risks when deployed at scale. Detoxification is therefore important for safety and user trust, particularly when we want to reduce harmful…

Computation and Language · Computer Science 2026-02-04 Baturay Saglam , Dionysis Kalogerias

Let the Models Respond: Interpreting Language Model Detoxification Through the Lens of Prompt Dependence

Due to language models' propensity to generate toxic or hateful responses, several techniques were developed to align model generations with users' preferences. Despite the effectiveness of such methods in improving the safety of model…

Computation and Language · Computer Science 2023-09-06 Daniel Scalena , Gabriele Sarti , Malvina Nissim , Elisabetta Fersini