Related papers: Bergeron: Combating Adversarial Attacks through a …

Safety Alignment Can Be Not Superficial With Explicit Safety Signals

Recent studies on the safety alignment of large language models (LLMs) have revealed that existing approaches often operate superficially, leaving models vulnerable to various adversarial attacks. Despite their significance, these studies…

Cryptography and Security · Computer Science 2025-06-02 Jianwei Li , Jung-Eun Kim

A Framework for Rapidly Developing and Deploying Protection Against Large Language Model Attacks

The widespread adoption of Large Language Models (LLMs) has revolutionized AI deployment, enabling autonomous and semi-autonomous applications across industries through intuitive language interfaces and continuous improvements in model…

Cryptography and Security · Computer Science 2025-10-20 Adam Swanda , Amy Chang , Alexander Chen , Fraser Burch , Paul Kassianik , Konstantin Berlin

Adversarial Attacks on Large Language Models Using Regularized Relaxation

As powerful Large Language Models (LLMs) are now widely used for numerous practical applications, their safety is of critical importance. While alignment techniques have significantly improved overall safety, LLMs remain vulnerable to…

Machine Learning · Computer Science 2024-10-28 Samuel Jacob Chacko , Sajib Biswas , Chashi Mahiul Islam , Fatema Tabassum Liza , Xiuwen Liu

Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks

Large Language Models (LLMs) are swiftly advancing in architecture and capability, and as they integrate more deeply into complex systems, the urgency to scrutinize their security properties grows. This paper surveys research in the…

Computation and Language · Computer Science 2023-10-18 Erfan Shayegani , Md Abdullah Al Mamun , Yu Fu , Pedram Zaree , Yue Dong , Nael Abu-Ghazaleh

Advancing NLP Security by Leveraging LLMs as Adversarial Engines

This position paper proposes a novel approach to advancing NLP security by leveraging Large Language Models (LLMs) as engines for generating diverse adversarial attacks. Building upon recent work demonstrating LLMs' effectiveness in…

Artificial Intelligence · Computer Science 2024-10-25 Sudarshan Srinivasan , Maria Mahbub , Amir Sadovnik

Jailbreaking and Mitigation of Vulnerabilities in Large Language Models

Large Language Models (LLMs) have transformed artificial intelligence by advancing natural language understanding and generation, enabling applications across fields beyond healthcare, software engineering, and conversational systems.…

Cryptography and Security · Computer Science 2026-05-29 Benji Peng , Hanxuan Chen , Keyu Chen , Qian Niu , Ziqian Bi , Ming Liu , Pohsun Feng , Tianyang Wang , Lawrence K. Q. Yan , Yizhu Wen , Yichao Zhang , Caitlyn Heqi Yin , Xinyuan Song , Riyang Bao , Jiacheng Shi

Breaking Down the Defenses: A Comparative Survey of Attacks on Large Language Models

Large Language Models (LLMs) have become a cornerstone in the field of Natural Language Processing (NLP), offering transformative capabilities in understanding and generating human-like text. However, with their rising prominence, the…

Cryptography and Security · Computer Science 2024-03-26 Arijit Ghosh Chowdhury , Md Mofijul Islam , Vaibhav Kumar , Faysal Hossain Shezan , Vaibhav Kumar , Vinija Jain , Aman Chadha

On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks

Robust verbal confidence generated by large language models (LLMs) is crucial for the deployment of LLMs to help ensure transparency, trust, and safety in many applications, including those involving human-AI interactions. In this paper, we…

Computation and Language · Computer Science 2025-12-19 Stephen Obadinma , Xiaodan Zhu

DarkLLM: Learning Language-Driven Adversarial Attacks with Large Language Models

While vision and multimodal foundation models underpin critical tasks from perception to complex reasoning, they remain highly vulnerable to adversarial attacks. However, traditional adversarial attacks are typically limited to single,…

Cryptography and Security · Computer Science 2026-05-20 Ye Sun , Xin Wang , Jiaming Zhang , Yifeng Gao , Yixu Wang , Yifan Ding , Qixian Zhang , Henghui Ding , Xingjun Ma , Yu-Gang Jiang

Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM

Recently, Large Language Models (LLMs) have made significant advancements and are now widely used across various domains. Unfortunately, there has been a rising concern that LLMs can be misused to generate harmful or malicious content.…

Computation and Language · Computer Science 2024-06-13 Bochuan Cao , Yuanpu Cao , Lu Lin , Jinghui Chen

Can Reinforcement Learning Unlock the Hidden Dangers in Aligned Large Language Models?

Large Language Models (LLMs) have demonstrated impressive capabilities in natural language tasks, but their safety and morality remain contentious due to their training on internet text corpora. To address these concerns, alignment…

Computation and Language · Computer Science 2024-08-06 Mohammad Bahrami Karkevandi , Nishant Vishwamitra , Peyman Najafirad

Attack and defense techniques in large language models: A survey and new perspectives

Large Language Models (LLMs) have become central to numerous natural language processing tasks, but their vulnerabilities present significant security and ethical challenges. This systematic survey explores the evolving landscape of attack…

Cryptography and Security · Computer Science 2025-05-05 Zhiyu Liao , Kang Chen , Yuanguo Lin , Kangkang Li , Yunxuan Liu , Hefeng Chen , Xingwang Huang , Yuanhui Yu

Adversarial Alignment: Ensuring Value Consistency in Large Language Models for Sensitive Domains

With the wide application of large language models (LLMs), the problems of bias and value inconsistency in sensitive domains have gradually emerged, especially in terms of race, society and politics. In this paper, we propose an adversarial…

Computation and Language · Computer Science 2026-01-23 Yuan Gao , Zhigang Liu , Xinyu Yao , Bo Chen , Xiaobing Zhao

Adversarial Attacks and Defenses in Large Language Models: Old and New Threats

Over the past decade, there has been extensive research aimed at enhancing the robustness of neural networks, yet this problem remains vastly unsolved. Here, one major impediment has been the overestimation of the robustness of new defense…

Artificial Intelligence · Computer Science 2023-10-31 Leo Schwinn , David Dobre , Stephan Günnemann , Gauthier Gidel

Normative Conflicts and Shallow AI Alignment

The progress of AI systems such as large language models (LLMs) raises increasingly pressing concerns about their safe deployment. This paper examines the value alignment problem for LLMs, arguing that current alignment strategies are…

Computation and Language · Computer Science 2025-06-06 Raphaël Millière

Recent Advances in Attack and Defense Approaches of Large Language Models

Large Language Models (LLMs) have revolutionized artificial intelligence and machine learning through their advanced text processing and generating capabilities. However, their widespread deployment has raised significant safety and…

Cryptography and Security · Computer Science 2024-12-03 Jing Cui , Yishi Xu , Zhewei Huang , Shuchang Zhou , Jianbin Jiao , Junge Zhang

Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models

Warning: This paper contains examples of harmful language, and reader discretion is recommended. The increasing open release of powerful large language models (LLMs) has facilitated the development of downstream applications by reducing the…

Computation and Language · Computer Science 2023-10-05 Xianjun Yang , Xiao Wang , Qi Zhang , Linda Petzold , William Yang Wang , Xun Zhao , Dahua Lin

Developing Assurance Cases for Adversarial Robustness and Regulatory Compliance in LLMs

This paper presents an approach to developing assurance cases for adversarial robustness and regulatory compliance in large language models (LLMs). Focusing on both natural and code language tasks, we explore the vulnerabilities these…

Cryptography and Security · Computer Science 2024-10-10 Tomas Bueno Momcilovic , Dian Balta , Beat Buesser , Giulio Zizzo , Mark Purcell

XBreaking: Understanding how LLMs security alignment can be broken

Large Language Models are fundamental actors in the modern IT landscape dominated by AI solutions. However, security threats associated with them might prevent their reliable adoption in critical application scenarios such as government…

Cryptography and Security · Computer Science 2025-11-10 Marco Arazzi , Vignesh Kumar Kembu , Antonino Nocera , Vinod P

Robustifying Safety-Aligned Large Language Models through Clean Data Curation

Large language models (LLMs) are vulnerable when trained on datasets containing harmful content, which leads to potential jailbreaking attacks in two scenarios: the integration of harmful texts within crowdsourced data used for pre-training…

Cryptography and Security · Computer Science 2024-06-03 Xiaoqun Liu , Jiacheng Liang , Muchao Ye , Zhaohan Xi