Related papers: Automated Harmfulness Testing for Code Large Langu…

Towards Safer Social Media Platforms: Scalable and Performant Few-Shot Harmful Content Moderation Using Large Language Models

The prevalence of harmful content on social media platforms poses significant risks to users and society, necessitating more effective and scalable content moderation strategies. Current approaches rely on human moderators, supervised…

Computation and Language · Computer Science 2025-01-27 Akash Bonagiri , Lucen Li , Rajvardhan Oak , Zeerak Babar , Magdalena Wojcieszak , Anshuman Chhabra

Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation of LLM

Large Language Models (LLMs) have revolutionized content creation across digital platforms, offering unprecedented capabilities in natural language generation and understanding. These models enable beneficial applications such as content…

Computation and Language · Computer Science 2025-08-14 Chi Zhang , Changjia Zhu , Junjie Xiong , Xiaoran Xu , Lingyao Li , Yao Liu , Zhuo Lu

Evaluating GPT-3 Generated Explanations for Hateful Content Moderation

Recent research has focused on using large language models (LLMs) to generate explanations for hate speech through fine-tuning or prompting. Despite the growing interest in this area, these generated explanations' effectiveness and…

Computation and Language · Computer Science 2023-08-31 Han Wang , Ming Shan Hee , Md Rabiul Awal , Kenny Tsu Wei Choo , Roy Ka-Wei Lee

FFT: Towards Harmlessness Evaluation and Analysis for LLMs with Factuality, Fairness, Toxicity

The widespread of generative artificial intelligence has heightened concerns about the potential harms posed by AI-generated texts, primarily stemming from factoid, unfair, and toxic content. Previous researchers have invested much effort…

Computation and Language · Computer Science 2024-12-24 Shiyao Cui , Zhenyu Zhang , Yilong Chen , Wenyuan Zhang , Tianyun Liu , Siqi Wang , Tingwen Liu

Towards Safer Pretraining: Analyzing and Filtering Harmful Content in Webscale datasets for Responsible LLMs

Large language models (LLMs) have become integral to various real-world applications, leveraging massive, web-sourced datasets like Common Crawl, C4, and FineWeb for pretraining. While these datasets provide linguistic data essential for…

Computation and Language · Computer Science 2025-08-14 Sai Krishna Mendu , Harish Yenala , Aditi Gulati , Shanu Kumar , Parag Agrawal

Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet

Large language models (LLMs) have become ubiquitous, thus it is important to understand their risks and limitations. Smaller LLMs can be deployed where compute resources are constrained, such as edge devices, but with different propensity…

Computation and Language · Computer Science 2025-04-22 Berk Atil , Vipul Gupta , Sarkar Snigdha Sarathi Das , Rebecca J. Passonneau

Re-ranking Using Large Language Models for Mitigating Exposure to Harmful Content on Social Media Platforms

Social media platforms utilize Machine Learning (ML) and Artificial Intelligence (AI) powered recommendation algorithms to maximize user engagement, which can result in inadvertent exposure to harmful content. Current moderation efforts,…

Computation and Language · Computer Science 2025-05-30 Rajvardhan Oak , Muhammad Haroon , Claire Jo , Magdalena Wojcieszak , Anshuman Chhabra

Metamorphic Malware Evolution: The Potential and Peril of Large Language Models

Code metamorphism refers to a computer programming exercise wherein the program modifies its own code (partial or entire) consistently and automatically while retaining its core functionality. This technique is often used for online…

Cryptography and Security · Computer Science 2024-11-05 Pooria Madani

Guiding AI to Fix Its Own Flaws: An Empirical Study on LLM-Driven Secure Code Generation

Large Language Models (LLMs) have become powerful tools for automated code generation. However, these models often overlook critical security practices, which can result in the generation of insecure code that contains…

Software Engineering · Computer Science 2025-07-01 Hao Yan , Swapneel Suhas Vaidya , Xiaokuan Zhang , Ziyu Yao

Supporting Human Raters with the Detection of Harmful Content using Large Language Models

In this paper, we explore the feasibility of leveraging large language models (LLMs) to automate or otherwise assist human raters with identifying harmful content including hate speech, harassment, violent extremism, and election…

Cryptography and Security · Computer Science 2024-06-19 Kurt Thomas , Patrick Gage Kelley , David Tao , Sarah Meiklejohn , Owen Vallis , Shunwen Tan , Blaž Bratanič , Felipe Tiengo Ferreira , Vijay Kumar Eranti , Elie Bursztein

AI Content Moderation in Therapy Conversations

Large language models (LLMs) are increasingly being used for emotional support. They are also being developed for formal therapy purposes. However, LLMs like ChaptGPT or Llama are often developed with content moderation guardrails that…

Human-Computer Interaction · Computer Science 2026-05-26 Jiwon Kim , Claire Wang , Taeung Yoon , Sabelle Huang , Koustuv Saha

Validating LLM-Generated Programs with Metamorphic Prompt Testing

The latest paradigm shift in software development brings in the innovation and automation afforded by Large Language Models (LLMs), showcased by Generative Pre-trained Transformer (GPT), which has shown remarkable capacity to generate code…

Software Engineering · Computer Science 2024-06-12 Xiaoyin Wang , Dakai Zhu

Large Language Models for Automatic Detection of Sensitive Topics

Sensitive information detection is crucial in content moderation to maintain safe online communities. Assisting in this traditionally manual process could relieve human moderators from overwhelming and tedious tasks, allowing them to focus…

Computation and Language · Computer Science 2024-09-04 Ruoyu Wen , Stephanie Elena Crowe , Kunal Gupta , Xinyue Li , Mark Billinghurst , Simon Hoermann , Dwain Allan , Alaeddin Nassani , Thammathip Piumsomboon

Towards Inclusive Toxic Content Moderation: Addressing Vulnerabilities to Adversarial Attacks in Toxicity Classifiers Tackling LLM-generated Content

The volume of machine-generated content online has grown dramatically due to the widespread use of Large Language Models (LLMs), leading to new challenges for content moderation systems. Conventional content moderation classifiers, which…

Computation and Language · Computer Science 2026-05-26 Shaz Furniturewala , Arkaitz Zubiaga

Why Do Large Language Models Generate Harmful Content?

Large Language Models (LLMs) have been shown to generate harmful content. However, the underlying causes of such behavior remain under explored. We propose a causal mediation analysis-based approach to identify the causal factors…

Artificial Intelligence · Computer Science 2026-04-14 Rajesh Ganguli , Raha Moraffah

Probing AI Safety with Source Code

Large language models (LLMs) have become ubiquitous, interfacing with humans in numerous safety-critical applications. This necessitates improving capabilities, but importantly coupled with greater safety measures to align these models with…

Computation and Language · Computer Science 2025-06-26 Ujwal Narayan , Shreyas Chaudhari , Ashwin Kalyan , Tanmay Rajpurohit , Karthik Narasimhan , Ameet Deshpande , Vishvak Murahari

MTTM: Metamorphic Testing for Textual Content Moderation Software

The exponential growth of social media platforms such as Twitter and Facebook has revolutionized textual communication and textual content publication in human society. However, they have been increasingly exploited to propagate toxic…

Computation and Language · Computer Science 2023-02-14 Wenxuan Wang , Jen-tse Huang , Weibin Wu , Jianping Zhang , Yizhan Huang , Shuqing Li , Pinjia He , Michael Lyu

LLM-based Semantic Augmentation for Harmful Content Detection

Recent advances in large language models (LLMs) have demonstrated strong performance on simple text classification tasks, frequently under zero-shot settings. However, their efficacy declines when tackling complex social media challenges…

Computation and Language · Computer Science 2025-04-23 Elyas Meguellati , Assaad Zeghina , Shazia Sadiq , Gianluca Demartini

Making Harmful Behaviors Unlearnable for Large Language Models

Large language models (LLMs) have shown great potential as general-purpose AI assistants in various domains. To meet the requirements of different applications, LLMs are often customized by further fine-tuning. However, the powerful…

Machine Learning · Computer Science 2023-11-07 Xin Zhou , Yi Lu , Ruotian Ma , Tao Gui , Qi Zhang , Xuanjing Huang

Tool-MCoT: Tool Augmented Multimodal Chain-of-Thought for Content Safety Moderation

The growth of online platforms and user content requires strong content moderation systems that can handle complex inputs from various media types. While large language models (LLMs) are effective, their high computational cost and latency…

Computation and Language · Computer Science 2026-04-09 Shutong Zhang , Dylan Zhou , Yinxiao Liu , Yang Yang , Huiwen Luo , Wenfei Zou