Related papers: Gradient-Based Language Model Red Teaming

MART: Improving LLM Safety with Multi-round Automatic Red-Teaming

Red-teaming is a common practice for mitigating unsafe behaviors in Large Language Models (LLMs), which involves thoroughly assessing LLMs to identify potential flaws and addressing them with responsible and accurate responses. While…

Computation and Language · Computer Science 2023-11-15 Suyu Ge , Chunting Zhou , Rui Hou , Madian Khabsa , Yi-Chia Wang , Qifan Wang , Jiawei Han , Yuning Mao

Learning diverse attacks on large language models for robust red-teaming and safety tuning

Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe and responsible deployment of large language models (LLMs). Developing effective protection against many modes of attack prompts…

Computation and Language · Computer Science 2025-03-03 Seanie Lee , Minsu Kim , Lynn Cherif , David Dobre , Juho Lee , Sung Ju Hwang , Kenji Kawaguchi , Gauthier Gidel , Yoshua Bengio , Nikolay Malkin , Moksh Jain

Quality-Diversity Red-Teaming: Automated Generation of High-Quality and Diverse Attackers for Large Language Models

Ensuring safety of large language models (LLMs) is important. Red teaming--a systematic approach to identifying adversarial prompts that elicit harmful responses from target LLMs--has emerged as a crucial safety evaluation method. Within…

Machine Learning · Computer Science 2025-06-10 Ren-Jian Wang , Ke Xue , Zeyu Qin , Ziniu Li , Sheng Tang , Hao-Tian Li , Shengcai Liu , Chao Qian

Learning-Based Automated Adversarial Red-Teaming for Robustness Evaluation of Large Language Models

The increasing deployment of large language models (LLMs) in safety-critical applications raises fundamental challenges in systematically evaluating robustness against adversarial behaviors. Existing red-teaming practices are largely manual…

Cryptography and Security · Computer Science 2026-04-29 Zhang Wei , Hanxuan Chen , Peilu Hu , Zhenyuan Wei , Chenwei Liang , Jing Luo , Ziyi Ni , Hao Yan , Li Mei , Shengning Lang , Kuan Lu , Xi Xiao , Zhimo Han , Yijin Wang , Yichao Zhang , Chen Yang , Junfeng Hao , Jiayi Gu , Riyang Bao , Mu-Jiang-Shan Wang

Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and Biases

Red-teaming has been a widely adopted way to evaluate the harmfulness of Large Language Models (LLMs). It aims to jailbreak a model's safety behavior to make it act as a helpful agent disregarding the harmfulness of the query. Existing…

Computation and Language · Computer Science 2023-11-14 Rishabh Bhardwaj , Soujanya Poria

Automated Progressive Red Teaming

Ensuring the safety of large language models (LLMs) is paramount, yet identifying potential vulnerabilities is challenging. While manual red teaming is effective, it is time-consuming, costly and lacks scalability. Automated red teaming…

Cryptography and Security · Computer Science 2024-12-24 Bojian Jiang , Yi Jing , Tianhao Shen , Tong Wu , Qing Yang , Deyi Xiong

Red Teaming Language Models with Language Models

Language Models (LMs) often cannot be deployed because of their potential to harm users in hard-to-predict ways. Prior work identifies harmful behaviors before deployment by using human annotators to hand-write test cases. However, human…

Computation and Language · Computer Science 2022-02-08 Ethan Perez , Saffron Huang , Francis Song , Trevor Cai , Roman Ring , John Aslanides , Amelia Glaese , Nat McAleese , Geoffrey Irving

Curiosity-driven Red-teaming for Large Language Models

Large language models (LLMs) hold great potential for many natural language applications but risk generating incorrect or toxic content. To probe when an LLM generates unwanted content, the current paradigm is to recruit a \textit{red team}…

Machine Learning · Computer Science 2024-03-01 Zhang-Wei Hong , Idan Shenfeld , Tsun-Hsuan Wang , Yung-Sung Chuang , Aldo Pareja , James Glass , Akash Srivastava , Pulkit Agrawal

Automated Red Teaming with GOAT: the Generative Offensive Agent Tester

Red teaming assesses how large language models (LLMs) can produce content that violates norms, policies, and rules set during their safety training. However, most existing automated methods in the literature are not representative of the…

Machine Learning · Computer Science 2024-10-03 Maya Pavlova , Erik Brinkman , Krithika Iyer , Vitor Albiero , Joanna Bitton , Hailey Nguyen , Joe Li , Cristian Canton Ferrer , Ivan Evtimov , Aaron Grattafiori

Multi-lingual Multi-turn Automated Red Teaming for LLMs

Language Model Models (LLMs) have improved dramatically in the past few years, increasing their adoption and the scope of their capabilities over time. A significant amount of work is dedicated to ``model alignment'', i.e., preventing LLMs…

Computation and Language · Computer Science 2025-04-07 Abhishek Singhania , Christophe Dupuy , Shivam Mangale , Amani Namboori

Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts

Automated red-teaming has emerged as a scalable approach for auditing Large Language Models (LLMs) prior to deployment, yet existing approaches lack mechanisms to efficiently adapt to model-specific vulnerabilities at inference. We…

Computation and Language · Computer Science 2026-05-19 Christos Ziakas , Nicholas Loo , Nishita Jain , Alessandra Russo

Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints

Recent work has proposed automated red-teaming methods for testing the vulnerabilities of a given target large language model (LLM). These methods use red-teaming LLMs to uncover inputs that induce harmful behavior in a target LLM. In this…

Machine Learning · Computer Science 2025-01-15 Jonathan Nöther , Adish Singla , Goran Radanović

Red Teaming for Large Language Models At Scale: Tackling Hallucinations on Mathematics Tasks

We consider the problem of red teaming LLMs on elementary calculations and algebraic tasks to evaluate how various prompting techniques affect the quality of outputs. We present a framework to procedurally generate numerical questions and…

Computation and Language · Computer Science 2024-01-02 Aleksander Buszydlik , Karol Dobiczek , Michał Teodor Okoń , Konrad Skublicki , Philip Lippmann , Jie Yang

Attack Prompt Generation for Red Teaming and Defending Large Language Models

Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to generate harmful content. Previous research constructs attack prompts via manual or automatic methods, which have their own limitations on…

Computation and Language · Computer Science 2023-10-20 Boyi Deng , Wenjie Wang , Fuli Feng , Yang Deng , Qifan Wang , Xiangnan He

Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique

In today's era, where large language models (LLMs) are integrated into numerous real-world applications, ensuring their safety and robustness is crucial for responsible AI usage. Automated red-teaming methods play a key role in this process…

Computation and Language · Computer Science 2024-08-21 Tej Deep Pala , Vernon Y. H. Toh , Rishabh Bhardwaj , Soujanya Poria

Explore, Establish, Exploit: Red Teaming Language Models from Scratch

Deploying large language models (LMs) can pose hazards from harmful outputs such as toxic or false text. Prior work has introduced automated tools that elicit harmful outputs to identify these risks. While this is a valuable step toward…

Computation and Language · Computer Science 2023-10-12 Stephen Casper , Jason Lin , Joe Kwon , Gatlen Culp , Dylan Hadfield-Menell

ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users

Large-scale pre-trained generative models are taking the world by storm, due to their abilities in generating creative content. Meanwhile, safeguards for these generative models are developed, to protect users' rights and safety, most of…

Cryptography and Security · Computer Science 2024-10-14 Guanlin Li , Kangjie Chen , Shudong Zhang , Jie Zhang , Tianwei Zhang

Red-Teaming for Inducing Societal Bias in Large Language Models

Ensuring the safe deployment of AI systems is critical in industry settings where biased outputs can lead to significant operational, reputational, and regulatory risks. Thorough evaluation before deployment is essential to prevent these…

Computation and Language · Computer Science 2025-05-23 Chu Fei Luo , Ahmad Ghawanmeh , Bharat Bhimshetty , Kashyap Murali , Murli Jadhav , Xiaodan Zhu , Faiza Khan Khattak

Embodied Red Teaming for Auditing Robotic Foundation Models

Language-conditioned robot models have the potential to enable robots to perform a wide range of tasks based on natural language instructions. However, assessing their safety and effectiveness remains challenging because it is difficult to…

Robotics · Computer Science 2025-02-11 Sathwik Karnik , Zhang-Wei Hong , Nishant Abhangi , Yen-Chen Lin , Tsun-Hsuan Wang , Christophe Dupuy , Rahul Gupta , Pulkit Agrawal

Query-Efficient Black-Box Red Teaming via Bayesian Optimization

The deployment of large-scale generative models is often restricted by their potential risk of causing harm to users in unpredictable ways. We focus on the problem of black-box red teaming, where a red team generates test cases and…

Artificial Intelligence · Computer Science 2023-05-30 Deokjae Lee , JunYeong Lee , Jung-Woo Ha , Jin-Hwa Kim , Sang-Woo Lee , Hwaran Lee , Hyun Oh Song