English
Related papers

Related papers: Gradient-Based Language Model Red Teaming

200 papers

Red-teaming is a common practice for mitigating unsafe behaviors in Large Language Models (LLMs), which involves thoroughly assessing LLMs to identify potential flaws and addressing them with responsible and accurate responses. While…

Computation and Language · Computer Science 2023-11-15 Suyu Ge , Chunting Zhou , Rui Hou , Madian Khabsa , Yi-Chia Wang , Qifan Wang , Jiawei Han , Yuning Mao

Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe and responsible deployment of large language models (LLMs). Developing effective protection against many modes of attack prompts…

Computation and Language · Computer Science 2025-03-03 Seanie Lee , Minsu Kim , Lynn Cherif , David Dobre , Juho Lee , Sung Ju Hwang , Kenji Kawaguchi , Gauthier Gidel , Yoshua Bengio , Nikolay Malkin , Moksh Jain

Ensuring safety of large language models (LLMs) is important. Red teaming--a systematic approach to identifying adversarial prompts that elicit harmful responses from target LLMs--has emerged as a crucial safety evaluation method. Within…

Machine Learning · Computer Science 2025-06-10 Ren-Jian Wang , Ke Xue , Zeyu Qin , Ziniu Li , Sheng Tang , Hao-Tian Li , Shengcai Liu , Chao Qian

The increasing deployment of large language models (LLMs) in safety-critical applications raises fundamental challenges in systematically evaluating robustness against adversarial behaviors. Existing red-teaming practices are largely manual…

Red-teaming has been a widely adopted way to evaluate the harmfulness of Large Language Models (LLMs). It aims to jailbreak a model's safety behavior to make it act as a helpful agent disregarding the harmfulness of the query. Existing…

Computation and Language · Computer Science 2023-11-14 Rishabh Bhardwaj , Soujanya Poria

Ensuring the safety of large language models (LLMs) is paramount, yet identifying potential vulnerabilities is challenging. While manual red teaming is effective, it is time-consuming, costly and lacks scalability. Automated red teaming…

Cryptography and Security · Computer Science 2024-12-24 Bojian Jiang , Yi Jing , Tianhao Shen , Tong Wu , Qing Yang , Deyi Xiong

Language Models (LMs) often cannot be deployed because of their potential to harm users in hard-to-predict ways. Prior work identifies harmful behaviors before deployment by using human annotators to hand-write test cases. However, human…

Computation and Language · Computer Science 2022-02-08 Ethan Perez , Saffron Huang , Francis Song , Trevor Cai , Roman Ring , John Aslanides , Amelia Glaese , Nat McAleese , Geoffrey Irving

Large language models (LLMs) hold great potential for many natural language applications but risk generating incorrect or toxic content. To probe when an LLM generates unwanted content, the current paradigm is to recruit a \textit{red team}…

Red teaming assesses how large language models (LLMs) can produce content that violates norms, policies, and rules set during their safety training. However, most existing automated methods in the literature are not representative of the…

Language Model Models (LLMs) have improved dramatically in the past few years, increasing their adoption and the scope of their capabilities over time. A significant amount of work is dedicated to ``model alignment'', i.e., preventing LLMs…

Computation and Language · Computer Science 2025-04-07 Abhishek Singhania , Christophe Dupuy , Shivam Mangale , Amani Namboori

Automated red-teaming has emerged as a scalable approach for auditing Large Language Models (LLMs) prior to deployment, yet existing approaches lack mechanisms to efficiently adapt to model-specific vulnerabilities at inference. We…

Computation and Language · Computer Science 2026-05-19 Christos Ziakas , Nicholas Loo , Nishita Jain , Alessandra Russo

Recent work has proposed automated red-teaming methods for testing the vulnerabilities of a given target large language model (LLM). These methods use red-teaming LLMs to uncover inputs that induce harmful behavior in a target LLM. In this…

Machine Learning · Computer Science 2025-01-15 Jonathan Nöther , Adish Singla , Goran Radanović

We consider the problem of red teaming LLMs on elementary calculations and algebraic tasks to evaluate how various prompting techniques affect the quality of outputs. We present a framework to procedurally generate numerical questions and…

Computation and Language · Computer Science 2024-01-02 Aleksander Buszydlik , Karol Dobiczek , Michał Teodor Okoń , Konrad Skublicki , Philip Lippmann , Jie Yang

Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to generate harmful content. Previous research constructs attack prompts via manual or automatic methods, which have their own limitations on…

Computation and Language · Computer Science 2023-10-20 Boyi Deng , Wenjie Wang , Fuli Feng , Yang Deng , Qifan Wang , Xiangnan He

In today's era, where large language models (LLMs) are integrated into numerous real-world applications, ensuring their safety and robustness is crucial for responsible AI usage. Automated red-teaming methods play a key role in this process…

Computation and Language · Computer Science 2024-08-21 Tej Deep Pala , Vernon Y. H. Toh , Rishabh Bhardwaj , Soujanya Poria

Deploying large language models (LMs) can pose hazards from harmful outputs such as toxic or false text. Prior work has introduced automated tools that elicit harmful outputs to identify these risks. While this is a valuable step toward…

Computation and Language · Computer Science 2023-10-12 Stephen Casper , Jason Lin , Joe Kwon , Gatlen Culp , Dylan Hadfield-Menell

Large-scale pre-trained generative models are taking the world by storm, due to their abilities in generating creative content. Meanwhile, safeguards for these generative models are developed, to protect users' rights and safety, most of…

Cryptography and Security · Computer Science 2024-10-14 Guanlin Li , Kangjie Chen , Shudong Zhang , Jie Zhang , Tianwei Zhang

Ensuring the safe deployment of AI systems is critical in industry settings where biased outputs can lead to significant operational, reputational, and regulatory risks. Thorough evaluation before deployment is essential to prevent these…

Computation and Language · Computer Science 2025-05-23 Chu Fei Luo , Ahmad Ghawanmeh , Bharat Bhimshetty , Kashyap Murali , Murli Jadhav , Xiaodan Zhu , Faiza Khan Khattak

Language-conditioned robot models have the potential to enable robots to perform a wide range of tasks based on natural language instructions. However, assessing their safety and effectiveness remains challenging because it is difficult to…

The deployment of large-scale generative models is often restricted by their potential risk of causing harm to users in unpredictable ways. We focus on the problem of black-box red teaming, where a red team generates test cases and…

Artificial Intelligence · Computer Science 2023-05-30 Deokjae Lee , JunYeong Lee , Jung-Woo Ha , Jin-Hwa Kim , Sang-Woo Lee , Hwaran Lee , Hyun Oh Song
‹ Prev 1 2 3 10 Next ›