Related papers: Adversarial Training for Large Neural Language Mod…

Watch your steps: Dormant Adversarial Behaviors that Activate upon LLM Finetuning

Finetuning open-weight Large Language Models (LLMs) is standard practice for achieving task-specific performance improvements. Until now, finetuning has been regarded as a controlled and secure process in which training on benign datasets…

Machine Learning · Computer Science 2025-10-10 Thibaud Gloaguen , Mark Vero , Robin Staab , Martin Vechev

Robust Deep Reinforcement Learning with Adversarial Attacks

This paper proposes adversarial attacks for Reinforcement Learning (RL) and then improves the robustness of Deep Reinforcement Learning algorithms (DRL) to parameter uncertainties with the help of these attacks. We show that even a naively…

Machine Learning · Computer Science 2017-12-12 Anay Pattanaik , Zhenyi Tang , Shuijing Liu , Gautham Bommannan , Girish Chowdhary

Adversarial Training via Adaptive Knowledge Amalgamation of an Ensemble of Teachers

Adversarial training (AT) is a popular method for training robust deep neural networks (DNNs) against adversarial attacks. Yet, AT suffers from two shortcomings: (i) the robustness of DNNs trained by AT is highly intertwined with the size…

Machine Learning · Computer Science 2024-05-24 Shayan Mohajer Hamidi , Linfeng Ye

Adversarial Reinforcement Learning for Large Language Model Agent Safety

Large Language Model (LLM) agents can leverage tools such as Google Search to complete complex tasks. However, this tool usage introduces the risk of indirect prompt injections, where malicious instructions hidden in tool outputs can…

Machine Learning · Computer Science 2025-10-08 Zizhao Wang , Dingcheng Li , Vaishakh Keshava , Phillip Wallis , Ananth Balashankar , Peter Stone , Lukas Rutishauser

Revisiting the Robust Generalization of Adversarial Prompt Tuning

Understanding the vulnerability of large-scale pre-trained vision-language models like CLIP against adversarial attacks is key to ensuring zero-shot generalization capacity on various downstream tasks. State-of-the-art defense mechanisms…

Computer Vision and Pattern Recognition · Computer Science 2024-05-21 Fan Yang , Mingxuan Xia , Sangzhou Xia , Chicheng Ma , Hui Hui

Dynamic Label Adversarial Training for Deep Learning Robustness Against Adversarial Attacks

Adversarial training is one of the most effective methods for enhancing model robustness. Recent approaches incorporate adversarial distillation in adversarial training architectures. However, we notice two scenarios of defense methods that…

Machine Learning · Computer Science 2024-08-26 Zhenyu Liu , Haoran Duan , Huizhi Liang , Yang Long , Vaclav Snasel , Guiseppe Nicosia , Rajiv Ranjan , Varun Ojha

Advancing Adversarial Robustness Through Adversarial Logit Update

Deep Neural Networks are susceptible to adversarial perturbations. Adversarial training and adversarial purification are among the most widely recognized defense strategies. Although these methods have different underlying logic, both rely…

Machine Learning · Computer Science 2023-08-30 Hao Xuan , Peican Zhu , Xingyu Li

Adversarial Training for Process Reward Models

Process Reward Models (PRMs) enhance reasoning ability of LLMs by providing step-level supervision. However, their widespread adoption is limited due to expensive manual step-level annotation and poor generalization of static training data…

Machine Learning · Computer Science 2025-12-01 Gurusha Juneja , Deepak Nathani , William Yang Wang

Self-Progressing Robust Training

Enhancing model robustness under new and even adversarial environments is a crucial milestone toward building trustworthy machine learning systems. Current robust training methods such as adversarial training explicitly uses an "attack"…

Machine Learning · Computer Science 2020-12-23 Minhao Cheng , Pin-Yu Chen , Sijia Liu , Shiyu Chang , Cho-Jui Hsieh , Payel Das

A General Retraining Framework for Scalable Adversarial Classification

Traditional classification algorithms assume that training and test data come from similar distributions. This assumption is violated in adversarial settings, where malicious actors modify instances to evade detection. A number of custom…

Computer Science and Game Theory · Computer Science 2016-11-29 Bo Li , Yevgeniy Vorobeychik , Xinyun Chen

Strengthening the Internal Adversarial Robustness in Lifted Neural Networks

Lifted neural networks (i.e. neural architectures explicitly optimizing over respective network potentials to determine the neural activities) can be combined with a type of adversarial training to gain robustness for internal as well as…

Machine Learning · Computer Science 2025-03-12 Christopher Zach

Learning to Learn from Mistakes: Robust Optimization for Adversarial Noise

Sensitivity to adversarial noise hinders deployment of machine learning algorithms in security-critical applications. Although many adversarial defenses have been proposed, robustness to adversarial noise remains an open problem. The most…

Machine Learning · Computer Science 2020-08-13 Alex Serban , Erik Poll , Joost Visser

Affine-Invariant Robust Training

The field of adversarial robustness has attracted significant attention in machine learning. Contrary to the common approach of training models that are accurate in average case, it aims at training models that are accurate for worst case…

Machine Learning · Computer Science 2020-10-12 Oriol Barbany Mayor

$\textit{LinkPrompt}$: Natural and Universal Adversarial Attacks on Prompt-based Language Models

Prompt-based learning is a new language model training paradigm that adapts the Pre-trained Language Models (PLMs) to downstream tasks, which revitalizes the performance benchmarks across various natural language processing (NLP) tasks.…

Computation and Language · Computer Science 2024-04-10 Yue Xu , Wenjie Wang

MoAPT: Mixture of Adversarial Prompt Tuning for Vision-Language Models

Large pre-trained Vision Language Models (VLMs) demonstrate excellent generalization capabilities but remain highly susceptible to adversarial examples, posing potential security risks. To improve the robustness of VLMs against adversarial…

Computer Vision and Pattern Recognition · Computer Science 2025-12-19 Shiji Zhao , Qihui Zhu , Shukun Xiong , Shouwei Ruan , Maoxun Yuan , Jialing Tao , Jiexi Liu , Ranjie Duan , Jie Zhang , Jie Zhang , Xingxing Wei

Adversarial Robustness vs Model Compression, or Both?

It is well known that deep neural networks (DNNs) are vulnerable to adversarial attacks, which are implemented by adding crafted perturbations onto benign examples. Min-max robust optimization based adversarial training can provide a notion…

Computer Vision and Pattern Recognition · Computer Science 2021-06-23 Shaokai Ye , Kaidi Xu , Sijia Liu , Jan-Henrik Lambrechts , Huan Zhang , Aojun Zhou , Kaisheng Ma , Yanzhi Wang , Xue Lin

Adversarial Training on Purification (AToP): Advancing Both Robustness and Generalization

The deep neural networks are known to be vulnerable to well-designed adversarial attacks. The most successful defense technique based on adversarial training (AT) can achieve optimal robustness against particular attacks but cannot…

Computer Vision and Pattern Recognition · Computer Science 2024-08-26 Guang Lin , Chao Li , Jianhai Zhang , Toshihisa Tanaka , Qibin Zhao

Improved OOD Generalization via Adversarial Training and Pre-training

Recently, learning a model that generalizes well on out-of-distribution (OOD) data has attracted great attention in the machine learning community. In this paper, after defining OOD generalization via Wasserstein distance, we theoretically…

Machine Learning · Computer Science 2021-05-25 Mingyang Yi , Lu Hou , Jiacheng Sun , Lifeng Shang , Xin Jiang , Qun Liu , Zhi-Ming Ma

Learning to Defend by Learning to Attack

Adversarial training provides a principled approach for training robust neural networks. From an optimization perspective, adversarial training is essentially solving a bilevel optimization problem. The leader problem is trying to learn a…

Machine Learning · Computer Science 2021-05-04 Haoming Jiang , Zhehui Chen , Yuyang Shi , Bo Dai , Tuo Zhao

Adversarial Augmentation Policy Search for Domain and Cross-Lingual Generalization in Reading Comprehension

Reading comprehension models often overfit to nuances of training datasets and fail at adversarial evaluation. Training with adversarially augmented dataset improves robustness against those adversarial attacks but hurts generalization of…

Computation and Language · Computer Science 2020-11-18 Adyasha Maharana , Mohit Bansal