Related papers: The Structural Safety Generalization Problem

Emerging Vulnerabilities in Frontier Models: Multi-Turn Jailbreak Attacks

Large language models (LLMs) are improving at an exceptional rate. However, these models are still susceptible to jailbreak attacks, which are becoming increasingly dangerous as models become increasingly powerful. In this work, we…

Cryptography and Security · Computer Science 2024-09-04 Tom Gibbs , Ethan Kosak-Hine , George Ingebretsen , Jason Zhang , Julius Broomfield , Sara Pieri , Reihaneh Iranmanesh , Reihaneh Rabbany , Kellin Pelrine

Jailbreaking and Mitigation of Vulnerabilities in Large Language Models

Large Language Models (LLMs) have transformed artificial intelligence by advancing natural language understanding and generation, enabling applications across fields beyond healthcare, software engineering, and conversational systems.…

Cryptography and Security · Computer Science 2026-05-29 Benji Peng , Hanxuan Chen , Keyu Chen , Qian Niu , Ziqian Bi , Ming Liu , Pohsun Feng , Tianyang Wang , Lawrence K. Q. Yan , Yizhu Wen , Yichao Zhang , Caitlyn Heqi Yin , Xinyuan Song , Riyang Bao , Jiacheng Shi

SoK: Evaluating Jailbreak Guardrails for Large Language Models

Large Language Models (LLMs) have achieved remarkable progress, but their deployment has exposed critical vulnerabilities, particularly to jailbreak attacks that circumvent safety alignments. Guardrails--external defense mechanisms that…

Cryptography and Security · Computer Science 2025-10-17 Xunguang Wang , Zhenlan Ji , Wenxuan Wang , Zongjie Li , Daoyuan Wu , Shuai Wang

GuardVal: Dynamic Large Language Model Jailbreak Evaluation for Comprehensive Safety Testing

Jailbreak attacks reveal critical vulnerabilities in Large Language Models (LLMs) by causing them to generate harmful or unethical content. Evaluating these threats is particularly challenging due to the evolving nature of LLMs and the…

Machine Learning · Computer Science 2025-07-11 Peiyan Zhang , Haibo Jin , Liying Kang , Haohan Wang

Multi-Turn Jailbreaks Are Simpler Than They Seem

While defenses against single-turn jailbreak attacks on Large Language Models (LLMs) have improved significantly, multi-turn jailbreaks remain a persistent vulnerability, often achieving success rates exceeding 70% against models optimized…

Machine Learning · Computer Science 2025-08-12 Xiaoxue Yang , Jaeha Lee , Anna-Katharina Dick , Jasper Timm , Fei Xie , Diogo Cruz

Jailbroken: How Does LLM Safety Training Fail?

Large language models trained for safety and harmlessness remain susceptible to adversarial misuse, as evidenced by the prevalence of "jailbreak" attacks on early releases of ChatGPT that elicit undesired behavior. Going beyond recognition…

Machine Learning · Computer Science 2023-07-06 Alexander Wei , Nika Haghtalab , Jacob Steinhardt

A Cross-Language Investigation into Jailbreak Attacks in Large Language Models

Large Language Models (LLMs) have become increasingly popular for their advanced text generation capabilities across various domains. However, like any software, they face security challenges, including the risk of 'jailbreak' attacks that…

Cryptography and Security · Computer Science 2024-01-31 Jie Li , Yi Liu , Chongyang Liu , Ling Shi , Xiaoning Ren , Yaowen Zheng , Yang Liu , Yinxing Xue

Fine-Tuning, Quantization, and LLMs: Navigating Unintended Outcomes

Large Language Models (LLMs) have gained widespread adoption across various domains, including chatbots and auto-task completion agents. However, these models are susceptible to safety vulnerabilities such as jailbreaking, prompt injection,…

Cryptography and Security · Computer Science 2024-09-10 Divyanshu Kumar , Anurakt Kumar , Sahil Agarwal , Prashanth Harshangi

Involuntary Jailbreak: On Self-Prompting Attacks

In this study, we disclose a worrying new vulnerability in Large Language Models (LLMs), which we term \textbf{involuntary jailbreak}. Unlike existing jailbreak attacks, this weakness is distinct in that it does not involve a specific…

Cryptography and Security · Computer Science 2025-12-30 Yangyang Guo , Yangyan Li , Mohan Kankanhalli

Jailbreaking Attacks vs. Content Safety Filters: How Far Are We in the LLM Safety Arms Race?

As large language models (LLMs) are increasingly deployed, ensuring their safe use is paramount. Jailbreaking, adversarial prompts that bypass model alignment to trigger harmful outputs, present significant risks, with existing studies…

Cryptography and Security · Computer Science 2026-01-01 Yuan Xin , Dingfan Chen , Linyi Yang , Michael Backes , Xiao Zhang

ConceptGuard: Neuro-Symbolic Safety Guardrails via Sparse Interpretable Jailbreak Concepts

Large Language Models have found success in a variety of applications. However, their safety remains a concern due to the existence of various jailbreaking methods. Despite significant efforts, alignment and safety fine-tuning only provide…

Computation and Language · Computer Science 2025-12-16 Darpan Aswal , Céline Hudelot

EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models

Jailbreak attacks are crucial for identifying and mitigating the security vulnerabilities of Large Language Models (LLMs). They are designed to bypass safeguards and elicit prohibited outputs. However, due to significant differences among…

Computation and Language · Computer Science 2024-03-20 Weikang Zhou , Xiao Wang , Limao Xiong , Han Xia , Yingshuang Gu , Mingxu Chai , Fukang Zhu , Caishuang Huang , Shihan Dou , Zhiheng Xi , Rui Zheng , Songyang Gao , Yicheng Zou , Hang Yan , Yifan Le , Ruohui Wang , Lijun Li , Jing Shao , Tao Gui , Qi Zhang , Xuanjing Huang

How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation

Jailbreak attacks, where harmful prompts bypass generative models' built-in safety, raise serious concerns about model vulnerability. While many defense methods have been proposed, the trade-offs between safety and helpfulness, and their…

Cryptography and Security · Computer Science 2025-02-21 Zhuohang Long , Siyuan Wang , Shujun Liu , Yuhang Lai , Xuanjing Huang , Zhongyu Wei

Compositional Jailbreaking: An Empirical Analysis of Mutator Chain Interactions in Aligned LLMs

Jailbreaking attacks on large language models pose a significant threat to AI safety by enabling the generation of harmful or restricted content. While prior work has explored both handcrafted and automated jailbreak strategies, the…

Cryptography and Security · Computer Science 2026-05-18 Reinelle Jan Bugnot , Soohyeon Choi , Hoon Wei Lim , Yue Duan

MultiBreak: A Scalable and Diverse Multi-turn Jailbreak Benchmark for Evaluating LLM Safety

We present MultiBreak, a scalable and diverse multi-turn jailbreak benchmark to evaluate large language model (LLM) safety. Multi-turn jailbreaks mimic natural conversational settings, making them easier to bypass safety-aligned LLM than…

Computation and Language · Computer Science 2026-05-05 Jialin Song , Xiaodong Liu , Weiwei Yang , Wuyang Chen , Mingqian Feng , Xuekai Zhu , Jianfeng Gao

The Struggle Between Continuation and Refusal: A Mechanistic Analysis of the Continuation-Triggered Jailbreak in LLMs

With the rapid advancement of large language models (LLMs), the safety of LLMs has become a critical concern. Despite significant efforts in safety alignment, current LLMs remain vulnerable to jailbreaking attacks. However, the root causes…

Artificial Intelligence · Computer Science 2026-03-10 Yonghong Deng , Zhen Yang , Ping Jian , Xinyue Zhang , Zhongbin Guo , Chengzhi Li

NeuroBreak: Unveil Internal Jailbreak Mechanisms in Large Language Models

In deployment and application, large language models (LLMs) typically undergo safety alignment to prevent illegal and unethical outputs. However, the continuous advancement of jailbreak attack techniques, designed to bypass safety…

Cryptography and Security · Computer Science 2025-09-05 Chuhan Zhang , Ye Zhang , Bowen Shi , Yuyou Gan , Tianyu Du , Shouling Ji , Dazhan Deng , Yingcai Wu

Jailbreaking Large Language Models with Morality Attacks

Pluralism alignment with AI has the sophisticated and necessary goal of creating AI that can coexist with and serve morally multifaceted humanity. Research towards pluralism alignment has many efforts in enhancing the learning of large…

Computation and Language · Computer Science 2026-04-21 Ying Su , Mingen Zheng , Weili Diao , Haoran Li

Jailbreaking Large Language Models with Symbolic Mathematics

Recent advancements in AI safety have led to increased efforts in training and red-teaming large language models (LLMs) to mitigate unsafe content generation. However, these safety mechanisms may not be comprehensive, leaving potential…

Cryptography and Security · Computer Science 2024-11-06 Emet Bethany , Mazal Bethany , Juan Arturo Nolazco Flores , Sumit Kumar Jha , Peyman Najafirad

SpatialJB: How Text Distribution Art Becomes the "Jailbreak Key" for LLM Guardrails

While Large Language Models (LLMs) have powerful capabilities, they remain vulnerable to jailbreak attacks, which is a critical barrier to their safe web real-time application. Current commercial LLM providers deploy output guardrails to…

Cryptography and Security · Computer Science 2026-01-15 Zhiyi Mou , Jingyuan Yang , Zeheng Qian , Wangze Ni , Tianfang Xiao , Ning Liu , Chen Zhang , Zhan Qin , Kui Ren