Related papers: Logit Pairing Methods Can Fool Gradient-Based Atta…

Evaluating and Understanding the Robustness of Adversarial Logit Pairing

We evaluate the robustness of Adversarial Logit Pairing, a recently proposed defense against adversarial examples. We find that a network trained with Adversarial Logit Pairing achieves 0.6% accuracy in the threat model in which the defense…

Machine Learning · Statistics 2018-11-27 Logan Engstrom , Andrew Ilyas , Anish Athalye

Improved Adversarial Robustness via Logit Regularization Methods

While great progress has been made at making neural networks effective across a wide range of visual tasks, most models are surprisingly vulnerable. This frailness takes the form of small, carefully chosen perturbations of their input,…

Machine Learning · Computer Science 2019-06-11 Cecilia Summers , Michael J. Dinneen

Improving Adversarial Robustness via Attention and Adversarial Logit Pairing

Though deep neural networks have achieved the state of the art performance in visual classification, recent studies have shown that they are all vulnerable to the attack of adversarial examples. In this paper, we develop improved techniques…

Machine Learning · Computer Science 2021-09-09 Dou Goodman , Xingjian Li , Ji Liu , Dejing Dou , Tao Wei

Adaptive Adversarial Logits Pairing

Adversarial examples provide an opportunity as well as impose a challenge for understanding image classification systems. Based on the analysis of the adversarial training solution Adversarial Logits Pairing (ALP), we observed in this work…

Computer Vision and Pattern Recognition · Computer Science 2021-04-19 Shangxi Wu , Jitao Sang , Kaiyuan Xu , Guanhua Zheng , Changsheng Xu

Adversarial Logit Pairing

In this paper, we develop improved techniques for defending against adversarial examples at scale. First, we implement the state of the art version of adversarial training at unprecedented scale on ImageNet and investigate whether it…

Machine Learning · Computer Science 2018-03-20 Harini Kannan , Alexey Kurakin , Ian Goodfellow

Label Smoothing and Logit Squeezing: A Replacement for Adversarial Training?

Adversarial training is one of the strongest defenses against adversarial attacks, but it requires adversarial examples to be generated for every mini-batch during optimization. The expense of producing these examples during training often…

Machine Learning · Computer Science 2019-10-28 Ali Shafahi , Amin Ghiasi , Furong Huang , Tom Goldstein

Adversarial Attacks on Large Language Models Using Regularized Relaxation

As powerful Large Language Models (LLMs) are now widely used for numerous practical applications, their safety is of critical importance. While alignment techniques have significantly improved overall safety, LLMs remain vulnerable to…

Machine Learning · Computer Science 2024-10-28 Samuel Jacob Chacko , Sajib Biswas , Chashi Mahiul Islam , Fatema Tabassum Liza , Xiuwen Liu

Advancing Adversarial Robustness Through Adversarial Logit Update

Deep Neural Networks are susceptible to adversarial perturbations. Adversarial training and adversarial purification are among the most widely recognized defense strategies. Although these methods have different underlying logic, both rely…

Machine Learning · Computer Science 2023-08-30 Hao Xuan , Peican Zhu , Xingyu Li

Attacking Large Language Models with Projected Gradient Descent

Current LLM alignment methods are readily broken through specifically crafted adversarial prompts. While crafting adversarial prompts using discrete optimization is highly effective, such attacks typically use more than 100,000 LLM calls.…

Machine Learning · Computer Science 2025-03-04 Simon Geisler , Tom Wollschläger , M. H. I. Abdalla , Johannes Gasteiger , Stephan Günnemann

Do Perceptually Aligned Gradients Imply Adversarial Robustness?

Adversarially robust classifiers possess a trait that non-robust models do not -- Perceptually Aligned Gradients (PAG). Their gradients with respect to the input align well with human perception. Several works have identified PAG as a…

Computer Vision and Pattern Recognition · Computer Science 2023-08-10 Roy Ganz , Bahjat Kawar , Michael Elad

Adapting to Evolving Adversaries with Regularized Continual Robust Training

Robust training methods typically defend against specific attack types, such as Lp attacks with fixed budgets, and rarely account for the fact that defenders may encounter new attacks over time. A natural solution is to adapt the defended…

Machine Learning · Computer Science 2025-02-07 Sihui Dai , Christian Cianfarani , Arjun Bhagoji , Vikash Sehwag , Prateek Mittal

Adversarial Contrastive Learning for LLM Quantization Attacks

Model quantization is critical for deploying large language models (LLMs) on resource-constrained hardware, yet recent work has revealed severe security risks that benign LLMs in full precision may exhibit malicious behaviors after…

Cryptography and Security · Computer Science 2026-01-07 Dinghong Song , Zhiwei Xu , Hai Wan , Xibin Zhao , Pengfei Su , Dong Li

Improving Adversarial Robustness by Putting More Regularizations on Less Robust Samples

Adversarial training, which is to enhance robustness against adversarial attacks, has received much attention because it is easy to generate human-imperceptible perturbations of data to deceive a given deep neural network. In this paper, we…

Machine Learning · Statistics 2023-06-02 Dongyoon Yang , Insung Kong , Yongdai Kim

Accurate, reliable and fast robustness evaluation

Throughout the past five years, the susceptibility of neural networks to minimal adversarial perturbations has moved from a peculiar phenomenon to a core issue in Deep Learning. Despite much attention, however, progress towards more robust…

Machine Learning · Statistics 2019-12-13 Wieland Brendel , Jonas Rauber , Matthias Kümmerer , Ivan Ustyuzhaninov , Matthias Bethge

Algebraic Adversarial Attacks on Integrated Gradients

Adversarial attacks on explainability models have drastic consequences when explanations are used to understand the reasoning of neural networks in safety critical systems. Path methods are one such class of attribution methods susceptible…

Machine Learning · Computer Science 2025-02-28 Lachlan Simpson , Federico Costanza , Kyle Millar , Adriel Cheng , Cheng-Chew Lim , Hong Gunn Chew

Regularizers for Single-step Adversarial Training

The progress in the last decade has enabled machine learning models to achieve impressive performance across a wide range of tasks in Computer Vision. However, a plethora of works have demonstrated the susceptibility of these models to…

Machine Learning · Computer Science 2020-02-06 B. S. Vivek , R. Venkatesh Babu

Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms

Current adversarial robustness methods for large language models require extensive datasets of harmful prompts (thousands to hundreds of thousands of examples), yet remain vulnerable to novel attack vectors and distributional shifts. We…

Artificial Intelligence · Computer Science 2026-05-12 Linh Le , David Williams-King , Mohamed Amine Merzouk , Aton Kamanda , Adam Oberman

Explainable Adversarial Attacks on Coarse-to-Fine Classifiers

Traditional adversarial attacks typically aim to alter the predicted labels of input images by generating perturbations that are imperceptible to the human eye. However, these approaches often lack explainability. Moreover, most existing…

Computer Vision and Pattern Recognition · Computer Science 2025-04-17 Akram Heidarizadeh , Connor Hatfield , Lorenzo Lazzarotto , HanQin Cai , George Atia

Improving Adversarial Robustness with Self-Paced Hard-Class Pair Reweighting

Deep Neural Networks are vulnerable to adversarial attacks. Among many defense strategies, adversarial training with untargeted attacks is one of the most effective methods. Theoretically, adversarial perturbation in untargeted attacks can…

Computer Vision and Pattern Recognition · Computer Science 2022-12-01 Pengyue Hou , Jie Han , Xingyu Li

MixAT: Combining Continuous and Discrete Adversarial Training for LLMs

Despite recent efforts in Large Language Model (LLM) safety and alignment, current adversarial attacks on frontier LLMs can still consistently force harmful generations. Although adversarial training has been widely studied and shown to…

Machine Learning · Computer Science 2025-10-29 Csaba Dékány , Stefan Balauca , Robin Staab , Dimitar I. Dimitrov , Martin Vechev