Related papers: Adversarial Tokenization

Rethinking Textual Adversarial Defense for Pre-trained Language Models

Although pre-trained language models (PrLMs) have achieved significant success, recent studies demonstrate that PrLMs are vulnerable to adversarial attacks. By generating adversarial examples with slight perturbations on different levels…

Computation and Language · Computer Science 2022-08-23 Jiayi Wang , Rongzhou Bao , Zhuosheng Zhang , Hai Zhao

LLM Lies: Hallucinations are not Bugs, but Features as Adversarial Examples

Large Language Models (LLMs), including GPT-3.5, LLaMA, and PaLM, seem to be knowledgeable and able to adapt to many tasks. However, we still cannot completely trust their answers, since LLMs suffer from \textbf{hallucination}\textemdash…

Computation and Language · Computer Science 2024-08-06 Jia-Yu Yao , Kun-Peng Ning , Zhen-Hui Liu , Mu-Nan Ning , Yu-Yang Liu , Li Yuan

Adversarial Attacks on Large Language Models Using Regularized Relaxation

As powerful Large Language Models (LLMs) are now widely used for numerous practical applications, their safety is of critical importance. While alignment techniques have significantly improved overall safety, LLMs remain vulnerable to…

Machine Learning · Computer Science 2024-10-28 Samuel Jacob Chacko , Sajib Biswas , Chashi Mahiul Islam , Fatema Tabassum Liza , Xiuwen Liu

Harnessing LLM to Attack LLM-Guarded Text-to-Image Models

To prevent Text-to-Image (T2I) models from generating unethical images, people deploy safety filters to block inappropriate drawing prompts. Previous works have employed token replacement to search adversarial prompts that attempt to bypass…

Artificial Intelligence · Computer Science 2024-11-27 Yimo Deng , Huangxun Chen

Token-Modification Adversarial Attacks for Natural Language Processing: A Survey

Many adversarial attacks target natural language processing systems, most of which succeed through modifying the individual tokens of a document. Despite the apparent uniqueness of each of these attacks, fundamentally they are simply a…

Computation and Language · Computer Science 2024-01-09 Tom Roth , Yansong Gao , Alsharif Abuadbba , Surya Nepal , Wei Liu

Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs

Although safely enhanced Large Language Models (LLMs) have achieved remarkable success in tackling various complex tasks in a zero-shot manner, they remain susceptible to jailbreak attacks, particularly the unknown jailbreak attack. To…

Computation and Language · Computer Science 2024-06-12 Fan Liu , Zhao Xu , Hao Liu

Certifying LLM Safety against Adversarial Prompting

Large language models (LLMs) are vulnerable to adversarial attacks that add malicious tokens to an input prompt to bypass the safety guardrails of an LLM and cause it to produce harmful content. In this work, we introduce erase-and-check,…

Computation and Language · Computer Science 2025-02-06 Aounon Kumar , Chirag Agarwal , Suraj Srinivas , Aaron Jiaxun Li , Soheil Feizi , Himabindu Lakkaraju

Adversarial Neural Networks for Cross-lingual Sequence Tagging

We study cross-lingual sequence tagging with little or no labeled data in the target language. Adversarial training has previously been shown to be effective for training cross-lingual sentence classifiers. However, it is not clear if…

Computation and Language · Computer Science 2018-08-15 Heike Adel , Anton Bryl , David Weiss , Aliaksei Severyn

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

As Large Language Models quickly become ubiquitous, it becomes critical to understand their security vulnerabilities. Recent work shows that text optimizers can produce jailbreaking prompts that bypass moderation and alignment. Drawing from…

Machine Learning · Computer Science 2023-09-06 Neel Jain , Avi Schwarzschild , Yuxin Wen , Gowthami Somepalli , John Kirchenbauer , Ping-yeh Chiang , Micah Goldblum , Aniruddha Saha , Jonas Geiping , Tom Goldstein

On the Hardness of Junking LLMs

Large language models (LLMs) are known to be vulnerable to jailbreak attacks, which typically rely on carefully designed prompts containing explicit semantic structure. These attacks generally operate by fixing an adversarial instruction…

Machine Learning · Computer Science 2026-05-07 Marco Rando , Samuel Vaiter

Modeling the Attack: Detecting AI-Generated Text by Quantifying Adversarial Perturbations

The growth of highly advanced Large Language Models (LLMs) constitutes a huge dual-use problem, making it necessary to create dependable AI-generated text detection systems. Modern detectors are notoriously vulnerable to adversarial…

Cryptography and Security · Computer Science 2025-10-06 Lekkala Sai Teja , Annepaka Yadagiri , Sangam Sai Anish , Siva Gopala Krishna Nuthakki , Partha Pakray

Unleashing the Unseen: Harnessing Benign Datasets for Jailbreaking Large Language Models

Despite significant ongoing efforts in safety alignment, large language models (LLMs) such as GPT-4 and LLaMA 3 remain vulnerable to jailbreak attacks that can induce harmful behaviors, including through the use of adversarial suffixes.…

Cryptography and Security · Computer Science 2024-12-20 Wei Zhao , Zhe Li , Yige Li , Jun Sun

Where is the signal in tokenization space?

Large Language Models (LLMs) are typically shipped with tokenizers that deterministically encode text into so-called canonical token sequences, to which the LLMs assign probability values. One common assumption is that the probability of a…

Computation and Language · Computer Science 2025-06-09 Renato Lui Geh , Honghua Zhang , Kareem Ahmed , Benjie Wang , Guy Van den Broeck

Tokenization Matters! Degrading Large Language Models through Challenging Their Tokenization

Large Language Models (LLMs) have shown remarkable capabilities in language understanding and generation. Nonetheless, it was also witnessed that LLMs tend to produce inaccurate responses to specific queries. This deficiency can be traced…

Computation and Language · Computer Science 2025-05-16 Dixuan Wang , Yanda Li , Junyuan Jiang , Zepeng Ding , Ziqin Luo , Guochao Jiang , Jiaqing Liang , Deqing Yang

Selective Adversarial Attacks on LLM Benchmarks

Benchmarking outcomes increasingly govern trust, selection, and deployment of LLMs, yet these evaluations remain vulnerable to semantically equivalent adversarial perturbations. Prior work on adversarial robustness in NLP has emphasized…

Machine Learning · Computer Science 2025-10-16 Ivan Dubrovsky , Anastasia Orlova , Illarion Iov , Nina Gubina , Irena Gureeva , Alexey Zaytsev

Evaluate-and-Purify: Fortifying Code Language Models Against Adversarial Attacks Using LLM-as-a-Judge

The widespread adoption of code language models in software engineering tasks has exposed vulnerabilities to adversarial attacks, especially the identifier substitution attacks. Although existing identifier substitution attackers…

Software Engineering · Computer Science 2025-04-29 Wenhan Mu , Ling Xu , Shuren Pei , Le Mi , Huichi Zhou

Tokens for Learning, Tokens for Unlearning: Mitigating Membership Inference Attacks in Large Language Models via Dual-Purpose Training

Large language models (LLMs) have become the backbone of modern natural language processing but pose privacy concerns about leaking sensitive training data. Membership inference attacks (MIAs), which aim to infer whether a sample is…

Machine Learning · Computer Science 2025-06-03 Toan Tran , Ruixuan Liu , Li Xiong

Beyond Suffixes: Token Position in GCG Adversarial Attacks on Large Language Models

Large Language Models (LLMs) have seen widespread adoption across multiple domains, creating an urgent need for robust safety alignment mechanisms. However, robustness remains challenging due to jailbreak attacks that bypass alignment via…

Machine Learning · Computer Science 2026-05-04 Hicham Eddoubi , Umar Faruk Abdullahi , Fadi Hassan

Benign Adversarial Attack: Tricking Models for Goodness

In spite of the successful application in many fields, machine learning models today suffer from notorious problems like vulnerability to adversarial examples. Beyond falling into the cat-and-mouse game between adversarial attack and…

Artificial Intelligence · Computer Science 2022-07-06 Jitao Sang , Xian Zhao , Jiaming Zhang , Zhiyu Lin

"That Is a Suspicious Reaction!": Interpreting Logits Variation to Detect NLP Adversarial Attacks

Adversarial attacks are a major challenge faced by current machine learning research. These purposely crafted inputs fool even the most advanced models, precluding their deployment in safety-critical applications. Extensive research in…

Artificial Intelligence · Computer Science 2023-06-30 Edoardo Mosca , Shreyash Agarwal , Javier Rando , Georg Groh