Related papers: Adversarial Training for Large Neural Language Mod…

Improving Diversity with Adversarially Learned Transformations for Domain Generalization

To be successful in single source domain generalization, maximizing diversity of synthesized domains has emerged as one of the most effective strategies. Many of the recent successes have come from methods that pre-specify the types of…

Machine Learning · Computer Science 2022-12-14 Tejas Gokhale , Rushil Anirudh , Jayaraman J. Thiagarajan , Bhavya Kailkhura , Chitta Baral , Yezhou Yang

Adversarial Robustness: From Self-Supervised Pre-Training to Fine-Tuning

Pretrained models from self-supervision are prevalently used in fine-tuning downstream tasks faster or for better accuracy. However, gaining robustness from pretraining is left unexplored. We introduce adversarial training into…

Computer Vision and Pattern Recognition · Computer Science 2020-03-31 Tianlong Chen , Sijia Liu , Shiyu Chang , Yu Cheng , Lisa Amini , Zhangyang Wang

Adversarial Attacks on Large Language Models Using Regularized Relaxation

As powerful Large Language Models (LLMs) are now widely used for numerous practical applications, their safety is of critical importance. While alignment techniques have significantly improved overall safety, LLMs remain vulnerable to…

Machine Learning · Computer Science 2024-10-28 Samuel Jacob Chacko , Sajib Biswas , Chashi Mahiul Islam , Fatema Tabassum Liza , Xiuwen Liu

Improved Text Classification via Contrastive Adversarial Training

We propose a simple and general method to regularize the fine-tuning of Transformer-based encoders for text classification tasks. Specifically, during fine-tuning we generate adversarial examples by perturbing the word embeddings of the…

Computation and Language · Computer Science 2022-02-21 Lin Pan , Chung-Wei Hang , Avirup Sil , Saloni Potdar

FAIR-TAT: Improving Model Fairness Using Targeted Adversarial Training

Deep neural networks are susceptible to adversarial attacks and common corruptions, which undermine their robustness. In order to enhance model resilience against such challenges, Adversarial Training (AT) has emerged as a prominent…

Machine Learning · Computer Science 2025-06-17 Tejaswini Medi , Steffen Jung , Margret Keuper

Robust LLM safeguarding via refusal feature adversarial training

Large language models (LLMs) are vulnerable to adversarial attacks that can elicit harmful responses. Defending against such attacks remains challenging due to the opacity of jailbreaking mechanisms and the high computational cost of…

Machine Learning · Computer Science 2025-03-21 Lei Yu , Virginie Do , Karen Hambardzumyan , Nicola Cancedda

Defending Against Universal Perturbations With Shared Adversarial Training

Classifiers such as deep neural networks have been shown to be vulnerable against adversarial perturbations on problems with high-dimensional input space. While adversarial training improves the robustness of image classifiers against such…

Computer Vision and Pattern Recognition · Computer Science 2019-08-14 Chaithanya Kumar Mummadi , Thomas Brox , Jan Hendrik Metzen

Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory

Adversarial training (AT) is an effective defense for large language models (LLMs) against jailbreak attacks, but performing AT on LLMs is costly. To improve the efficiency of AT for LLMs, recent studies propose continuous AT (CAT) that…

Machine Learning · Computer Science 2026-04-15 Shaopeng Fu , Di Wang

Modeling Adversarial Attack on Pre-trained Language Models as Sequential Decision Making

Pre-trained language models (PLMs) have been widely used to underpin various downstream tasks. However, the adversarial attack task has found that PLMs are vulnerable to small perturbations. Mainstream methods adopt a detached two-stage…

Computation and Language · Computer Science 2023-05-30 Xuanjie Fang , Sijie Cheng , Yang Liu , Wei Wang

Dynamic Adversarial Reinforcement Learning for Robust Multimodal Large Language Models

Despite their impressive capabilities, Multimodal Large Language Models (MLLMs) exhibit perceptual fragility when confronted with visually complex scenes. This weakness stems from a reliance on finite training datasets, which are…

Machine Learning · Computer Science 2026-03-05 Yicheng Bao , Xuhong Wang , Qiaosheng Zhang , Chaochao Lu , Xia Hu , Xin Tan

WARP: Word-level Adversarial ReProgramming

Transfer learning from pretrained language models recently became the dominant approach for solving many NLP tasks. A common approach to transfer learning for multiple tasks that maximize parameter sharing trains one or more task-specific…

Computation and Language · Computer Science 2021-06-03 Karen Hambardzumyan , Hrant Khachatrian , Jonathan May

Improving Generalization of Adversarial Training via Robust Critical Fine-Tuning

Deep neural networks are susceptible to adversarial examples, posing a significant security risk in critical applications. Adversarial Training (AT) is a well-established technique to enhance adversarial robustness, but it often comes at…

Machine Learning · Computer Science 2023-08-08 Kaijie Zhu , Jindong Wang , Xixu Hu , Xing Xie , Ge Yang

Taming the Long Tail: Rebalancing Adversarial Training via Adaptive Perturbation

Deep neural networks are highly vulnerable to adversarial examples, i.e.,small perturbations that can significantly degrade model performance. While adversarial training has become the primary defense strategy, most studies focus on…

Machine Learning · Computer Science 2026-05-14 Lilin Zhang , Yimo Guo , Yue Li , Jiancheng Shi , Xianggen Liu

An Empirical Study of the Influence of Adversarial Fine-Tuning on Compressed Neural Networks

As deep learning (DL) models are increasingly being integrated into our everyday lives, ensuring their safety by making them robust against adversarial attacks has become increasingly critical. DL models have been found to be susceptible to…

Machine Learning · Computer Science 2026-05-29 Hallgrimur Thorsteinsson , Valdemar J Henriksen , Daniel I R Cruz , Raghavendra Selvan , Tong Chen

Hyper Adversarial Tuning for Boosting Adversarial Robustness of Pretrained Large Vision Models

Large vision models have been found vulnerable to adversarial examples, emphasizing the need for enhancing their adversarial robustness. While adversarial training is an effective defense for deep convolutional models, it often faces…

Computer Vision and Pattern Recognition · Computer Science 2024-10-10 Kangtao Lv , Huangsen Cao , Kainan Tu , Yihuai Xu , Zhimeng Zhang , Xin Ding , Yongwei Wang

Toward Adversarial Training on Contextualized Language Representation

Beyond the success story of adversarial training (AT) in the recent text domain on top of pre-trained language models (PLMs), our empirical study showcases the inconsistent gains from AT on some tasks, e.g. commonsense reasoning, named…

Computation and Language · Computer Science 2023-05-09 Hongqiu Wu , Yongxiang Liu , Hanwen Shi , Hai Zhao , Min Zhang

On Norm-Agnostic Robustness of Adversarial Training

Adversarial examples are carefully perturbed in-puts for fooling machine learning models. A well-acknowledged defense method against such examples is adversarial training, where adversarial examples are injected into training data to…

Machine Learning · Computer Science 2019-05-17 Bai Li , Changyou Chen , Wenlin Wang , Lawrence Carin

Visualizing and Understanding the Effectiveness of BERT

Language model pre-training, such as BERT, has achieved remarkable results in many NLP tasks. However, it is unclear why the pre-training-then-fine-tuning paradigm can improve performance and generalization capability across different…

Computation and Language · Computer Science 2019-08-16 Yaru Hao , Li Dong , Furu Wei , Ke Xu

Closing the Distribution Gap in Adversarial Training for LLMs

Adversarial training for LLMs is one of the most promising methods to reliably improve robustness against adversaries. However, despite significant progress, models remain vulnerable to simple in-distribution exploits, such as rewriting…

Machine Learning · Computer Science 2026-02-19 Chengzhi Hu , Jonas Dornbusch , David Lüdke , Stephan Günnemann , Leo Schwinn

Fast Training of Deep Neural Networks Robust to Adversarial Perturbations

Deep neural networks are capable of training fast and generalizing well within many domains. Despite their promising performance, deep networks have shown sensitivities to perturbations of their inputs (e.g., adversarial examples) and their…

Machine Learning · Computer Science 2020-07-09 Justin Goodwin , Olivia Brown , Victoria Helus