Related papers: Adversarial Training for Large Neural Language Mod…

FreeLB: Enhanced Adversarial Training for Natural Language Understanding

Adversarial training, which minimizes the maximal risk for label-preserving input perturbations, has proved to be effective for improving the generalization of language models. In this work, we propose a novel adversarial training…

Computation and Language · Computer Science 2020-04-24 Chen Zhu , Yu Cheng , Zhe Gan , Siqi Sun , Tom Goldstein , Jingjing Liu

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

Large language models (LLMs) can often be made to behave in undesirable ways that they are explicitly fine-tuned not to. For example, the LLM red-teaming literature has produced a wide variety of 'jailbreaking' techniques to elicit harmful…

Machine Learning · Computer Science 2025-07-30 Abhay Sheshadri , Aidan Ewart , Phillip Guo , Aengus Lynch , Cindy Wu , Vivek Hebbar , Henry Sleight , Asa Cooper Stickland , Ethan Perez , Dylan Hadfield-Menell , Stephen Casper

Algorithms for Adversarially Robust Deep Learning

Given the widespread use of deep learning models in safety-critical applications, ensuring that the decisions of such models are robust against adversarial exploitation is of fundamental importance. In this thesis, we discuss recent…

Machine Learning · Computer Science 2025-09-24 Alexander Robey

Impact of Adversarial Training on Robustness and Generalizability of Language Models

Adversarial training is widely acknowledged as the most effective defense against adversarial attacks. However, it is also well established that achieving both robustness and generalization in adversarially trained models involves a…

Computation and Language · Computer Science 2023-12-12 Enes Altinisik , Hassan Sajjad , Husrev Taha Sencar , Safa Messaoud , Sanjay Chawla

How Should Pre-Trained Language Models Be Fine-Tuned Towards Adversarial Robustness?

The fine-tuning of pre-trained language models has a great success in many NLP fields. Yet, it is strikingly vulnerable to adversarial examples, e.g., word substitution attacks using only synonyms can easily fool a BERT-based sentiment…

Computation and Language · Computer Science 2021-12-23 Xinhsuai Dong , Luu Anh Tuan , Min Lin , Shuicheng Yan , Hanwang Zhang

Adversarial Robustness through Dynamic Ensemble Learning

Adversarial attacks pose a significant threat to the reliability of pre-trained language models (PLMs) such as GPT, BERT, RoBERTa, and T5. This paper presents Adversarial Robustness through Dynamic Ensemble Learning (ARDEL), a novel scheme…

Cryptography and Security · Computer Science 2025-05-14 Hetvi Waghela , Jaydip Sen , Sneha Rakshit

Towards Improving Adversarial Training of NLP Models

Adversarial training, a method for learning robust deep neural networks, constructs adversarial examples during training. However, recent methods for generating NLP adversarial examples involve combinatorial search and expensive sentence…

Computation and Language · Computer Science 2021-09-14 Jin Yong Yoo , Yanjun Qi

Improving Neural Language Modeling via Adversarial Training

Recently, substantial progress has been made in language modeling by using deep neural networks. However, in practice, large scale neural language models have been shown to be prone to overfitting. In this paper, we present a simple yet…

Machine Learning · Computer Science 2019-09-10 Dilin Wang , Chengyue Gong , Qiang Liu

How Does Adversarial Fine-Tuning Benefit BERT?

Adversarial training (AT) is one of the most reliable methods for defending against adversarial attacks in machine learning. Variants of this method have been used as regularization mechanisms to achieve SOTA results on NLP benchmarks, and…

Computation and Language · Computer Science 2021-09-30 Javid Ebrahimi , Hao Yang , Wei Zhang

Adversarial Training for Improving Model Robustness? Look at Both Prediction and Interpretation

Neural language models show vulnerability to adversarial examples which are semantically similar to their original counterparts with a few words replaced by their synonyms. A common way to improve model robustness is adversarial training…

Computation and Language · Computer Science 2022-03-25 Hanjie Chen , Yangfeng Ji

Adversarial Training with Contrastive Learning in NLP

For years, adversarial training has been extensively studied in natural language processing (NLP) settings. The main goal is to make models robust so that similar inputs derive in semantically similar outcomes, which is not a trivial…

Computation and Language · Computer Science 2021-09-21 Daniela N. Rim , DongNyeong Heo , Heeyoul Choi

Adversarial Training Can Provably Improve Robustness: Theoretical Analysis of Feature Learning Process Under Structured Data

Adversarial training is a widely-applied approach to training deep neural networks to be robust against adversarial perturbation. However, although adversarial training has achieved empirical success in practice, it still remains unclear…

Machine Learning · Computer Science 2025-02-10 Binghui Li , Yuanzhi Li

On the Generalization Properties of Adversarial Training

Modern machine learning and deep learning models are shown to be vulnerable when testing data are slightly perturbed. Existing theoretical studies of adversarial training algorithms mostly focus on either adversarial training losses or…

Machine Learning · Statistics 2021-04-07 Yue Xing , Qifan Song , Guang Cheng

$\ell_\infty$-Robustness and Beyond: Unleashing Efficient Adversarial Training

Neural networks are vulnerable to adversarial attacks: adding well-crafted, imperceptible perturbations to their input can modify their output. Adversarial training is one of the most effective approaches in training robust models against…

Machine Learning · Computer Science 2022-07-20 Hadi M. Dolatabadi , Sarah Erfani , Christopher Leckie

Combining Adversaries with Anti-adversaries in Training

Adversarial training is an effective learning technique to improve the robustness of deep neural networks. In this study, the influence of adversarial training on deep learning models in terms of fairness, robustness, and generalization is…

Machine Learning · Computer Science 2023-05-19 Xiaoling Zhou , Nan Yang , Ou Wu

MAT: Mixed-Strategy Game of Adversarial Training in Fine-tuning

Fine-tuning large-scale pre-trained language models has been demonstrated effective for various natural language processing (NLP) tasks. Previous studies have established that incorporating adversarial training during the fine-tuning stage…

Computation and Language · Computer Science 2023-06-29 Zhehua Zhong , Tianyi Chen , Zhen Wang

One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models

Large pre-trained Vision-Language Models (VLMs) like CLIP, despite having remarkable generalization ability, are highly vulnerable to adversarial examples. This work studies the adversarial robustness of VLMs from the novel perspective of…

Computer Vision and Pattern Recognition · Computer Science 2024-03-05 Lin Li , Haoyan Guan , Jianing Qiu , Michael Spratling

Improving Adversarial Robustness by Putting More Regularizations on Less Robust Samples

Adversarial training, which is to enhance robustness against adversarial attacks, has received much attention because it is easy to generate human-imperceptible perturbations of data to deceive a given deep neural network. In this paper, we…

Machine Learning · Statistics 2023-06-02 Dongyoon Yang , Insung Kong , Yongdai Kim

Generative Adversarial Training Can Improve Neural Language Models

While deep learning in the form of recurrent neural networks (RNNs) has caused a significant improvement in neural language modeling, the fact that they are extremely prone to overfitting is still a mainly unresolved issue. In this paper we…

Computation and Language · Computer Science 2022-11-18 Sajad Movahedi , Azadeh Shakery

Robustness and Generalization via Generative Adversarial Training

While deep neural networks have achieved remarkable success in various computer vision tasks, they often fail to generalize to new domains and subtle variations of input images. Several defenses have been proposed to improve the robustness…

Computer Vision and Pattern Recognition · Computer Science 2021-09-08 Omid Poursaeed , Tianxing Jiang , Harry Yang , Serge Belongie , SerNam Lim