English
Related papers

Related papers: The Structural Safety Generalization Problem

200 papers

Large language models (LLMs) are improving at an exceptional rate. However, these models are still susceptible to jailbreak attacks, which are becoming increasingly dangerous as models become increasingly powerful. In this work, we…

Large Language Models (LLMs) have transformed artificial intelligence by advancing natural language understanding and generation, enabling applications across fields beyond healthcare, software engineering, and conversational systems.…

Large Language Models (LLMs) have achieved remarkable progress, but their deployment has exposed critical vulnerabilities, particularly to jailbreak attacks that circumvent safety alignments. Guardrails--external defense mechanisms that…

Cryptography and Security · Computer Science 2025-10-17 Xunguang Wang , Zhenlan Ji , Wenxuan Wang , Zongjie Li , Daoyuan Wu , Shuai Wang

Jailbreak attacks reveal critical vulnerabilities in Large Language Models (LLMs) by causing them to generate harmful or unethical content. Evaluating these threats is particularly challenging due to the evolving nature of LLMs and the…

Machine Learning · Computer Science 2025-07-11 Peiyan Zhang , Haibo Jin , Liying Kang , Haohan Wang

While defenses against single-turn jailbreak attacks on Large Language Models (LLMs) have improved significantly, multi-turn jailbreaks remain a persistent vulnerability, often achieving success rates exceeding 70% against models optimized…

Machine Learning · Computer Science 2025-08-12 Xiaoxue Yang , Jaeha Lee , Anna-Katharina Dick , Jasper Timm , Fei Xie , Diogo Cruz

Large language models trained for safety and harmlessness remain susceptible to adversarial misuse, as evidenced by the prevalence of "jailbreak" attacks on early releases of ChatGPT that elicit undesired behavior. Going beyond recognition…

Machine Learning · Computer Science 2023-07-06 Alexander Wei , Nika Haghtalab , Jacob Steinhardt

Large Language Models (LLMs) have become increasingly popular for their advanced text generation capabilities across various domains. However, like any software, they face security challenges, including the risk of 'jailbreak' attacks that…

Cryptography and Security · Computer Science 2024-01-31 Jie Li , Yi Liu , Chongyang Liu , Ling Shi , Xiaoning Ren , Yaowen Zheng , Yang Liu , Yinxing Xue

Large Language Models (LLMs) have gained widespread adoption across various domains, including chatbots and auto-task completion agents. However, these models are susceptible to safety vulnerabilities such as jailbreaking, prompt injection,…

Cryptography and Security · Computer Science 2024-09-10 Divyanshu Kumar , Anurakt Kumar , Sahil Agarwal , Prashanth Harshangi

In this study, we disclose a worrying new vulnerability in Large Language Models (LLMs), which we term \textbf{involuntary jailbreak}. Unlike existing jailbreak attacks, this weakness is distinct in that it does not involve a specific…

Cryptography and Security · Computer Science 2025-12-30 Yangyang Guo , Yangyan Li , Mohan Kankanhalli

As large language models (LLMs) are increasingly deployed, ensuring their safe use is paramount. Jailbreaking, adversarial prompts that bypass model alignment to trigger harmful outputs, present significant risks, with existing studies…

Cryptography and Security · Computer Science 2026-01-01 Yuan Xin , Dingfan Chen , Linyi Yang , Michael Backes , Xiao Zhang

Large Language Models have found success in a variety of applications. However, their safety remains a concern due to the existence of various jailbreaking methods. Despite significant efforts, alignment and safety fine-tuning only provide…

Computation and Language · Computer Science 2025-12-16 Darpan Aswal , Céline Hudelot

Jailbreak attacks are crucial for identifying and mitigating the security vulnerabilities of Large Language Models (LLMs). They are designed to bypass safeguards and elicit prohibited outputs. However, due to significant differences among…

Jailbreak attacks, where harmful prompts bypass generative models' built-in safety, raise serious concerns about model vulnerability. While many defense methods have been proposed, the trade-offs between safety and helpfulness, and their…

Cryptography and Security · Computer Science 2025-02-21 Zhuohang Long , Siyuan Wang , Shujun Liu , Yuhang Lai , Xuanjing Huang , Zhongyu Wei

Jailbreaking attacks on large language models pose a significant threat to AI safety by enabling the generation of harmful or restricted content. While prior work has explored both handcrafted and automated jailbreak strategies, the…

Cryptography and Security · Computer Science 2026-05-18 Reinelle Jan Bugnot , Soohyeon Choi , Hoon Wei Lim , Yue Duan

We present MultiBreak, a scalable and diverse multi-turn jailbreak benchmark to evaluate large language model (LLM) safety. Multi-turn jailbreaks mimic natural conversational settings, making them easier to bypass safety-aligned LLM than…

Computation and Language · Computer Science 2026-05-05 Jialin Song , Xiaodong Liu , Weiwei Yang , Wuyang Chen , Mingqian Feng , Xuekai Zhu , Jianfeng Gao

With the rapid advancement of large language models (LLMs), the safety of LLMs has become a critical concern. Despite significant efforts in safety alignment, current LLMs remain vulnerable to jailbreaking attacks. However, the root causes…

Artificial Intelligence · Computer Science 2026-03-10 Yonghong Deng , Zhen Yang , Ping Jian , Xinyue Zhang , Zhongbin Guo , Chengzhi Li

In deployment and application, large language models (LLMs) typically undergo safety alignment to prevent illegal and unethical outputs. However, the continuous advancement of jailbreak attack techniques, designed to bypass safety…

Cryptography and Security · Computer Science 2025-09-05 Chuhan Zhang , Ye Zhang , Bowen Shi , Yuyou Gan , Tianyu Du , Shouling Ji , Dazhan Deng , Yingcai Wu

Pluralism alignment with AI has the sophisticated and necessary goal of creating AI that can coexist with and serve morally multifaceted humanity. Research towards pluralism alignment has many efforts in enhancing the learning of large…

Computation and Language · Computer Science 2026-04-21 Ying Su , Mingen Zheng , Weili Diao , Haoran Li

Recent advancements in AI safety have led to increased efforts in training and red-teaming large language models (LLMs) to mitigate unsafe content generation. However, these safety mechanisms may not be comprehensive, leaving potential…

Cryptography and Security · Computer Science 2024-11-06 Emet Bethany , Mazal Bethany , Juan Arturo Nolazco Flores , Sumit Kumar Jha , Peyman Najafirad

While Large Language Models (LLMs) have powerful capabilities, they remain vulnerable to jailbreak attacks, which is a critical barrier to their safe web real-time application. Current commercial LLM providers deploy output guardrails to…

Cryptography and Security · Computer Science 2026-01-15 Zhiyi Mou , Jingyuan Yang , Zeheng Qian , Wangze Ni , Tianfang Xiao , Ning Liu , Chen Zhang , Zhan Qin , Kui Ren
‹ Prev 1 2 3 10 Next ›