Related papers: Fooling Explanations in Text Classifiers

Exposing Vulnerabilities in Explanation for Time Series Classifiers via Dual-Target Attacks

Interpretable time series deep learning systems are often assessed by checking temporal consistency on explanations, implicitly treating this as evidence of robustness. We show that this assumption can fail: Predictions and explanations can…

Machine Learning · Computer Science 2026-02-10 Bohan Wang , Zewen Liu , Lu Lin , Hui Liu , Li Xiong , Ming Jin , Wei Jin

Estimating the Adversarial Robustness of Attributions in Text with Transformers

Explanations are crucial parts of deep neural network (DNN) classifiers. In high stakes applications, faithful and robust explanations are important to understand and gain trust in DNN classifiers. However, recent work has shown that…

Machine Learning · Computer Science 2022-12-20 Adam Ivankay , Mattia Rigotti , Ivan Girardi , Chiara Marchiori , Pascal Frossard

Attack to Fool and Explain Deep Networks

Deep visual models are susceptible to adversarial perturbations to inputs. Although these signals are carefully crafted, they still appear noise-like patterns to humans. This observation has led to the argument that deep visual…

Computer Vision and Pattern Recognition · Computer Science 2021-06-22 Naveed Akhtar , Muhammad A. A. K. Jalwana , Mohammed Bennamoun , Ajmal Mian

A Differentiable Language Model Adversarial Attack on Text Classifiers

Robustness of huge Transformer-based models for natural language processing is an important issue due to their capabilities and wide adoption. One way to understand and improve robustness of these models is an exploration of an adversarial…

Computation and Language · Computer Science 2021-07-26 Ivan Fursov , Alexey Zaytsev , Pavel Burnyshev , Ekaterina Dmitrieva , Nikita Klyuchnikov , Andrey Kravchenko , Ekaterina Artemova , Evgeny Burnaev

Robustness of Explanation Methods for NLP Models

Explanation methods have emerged as an important tool to highlight the features responsible for the predictions of neural networks. There is mounting evidence that many explanation methods are rather unreliable and susceptible to malicious…

Computation and Language · Computer Science 2022-06-27 Shriya Atmakuri , Tejas Chheda , Dinesh Kandula , Nishant Yadav , Taesung Lee , Hessel Tuinhof

FINER: Enhancing State-of-the-art Classifiers with Feature Attribution to Facilitate Security Analysis

Deep learning classifiers achieve state-of-the-art performance in various risk detection applications. They explore rich semantic representations and are supposed to automatically discover risk behaviors. However, due to the lack of…

Cryptography and Security · Computer Science 2025-05-15 Yiling He , Jian Lou , Zhan Qin , Kui Ren

Alert-ME: An Explainability-Driven Defense Against Adversarial Examples in Transformer-Based Text Classification

Transformer-based text classifiers such as BERT, RoBERTa, T5, and GPT have shown strong performance in natural language processing tasks but remain vulnerable to adversarial examples. These vulnerabilities raise significant security…

Computation and Language · Computer Science 2025-10-27 Bushra Sabir , Yansong Gao , Alsharif Abuadbba , M. Ali Babar

A Character-Level Approach to the Text Normalization Problem Based on a New Causal Encoder

Text normalization is a ubiquitous process that appears as the first step of many Natural Language Processing problems. However, previous Deep Learning approaches have suffered from so-called silly errors, which are undetectable on…

Computation and Language · Computer Science 2019-03-08 Adrián Javaloy Bornás , Ginés García Mateos

SelfExplain: A Self-Explaining Architecture for Neural Text Classifiers

We introduce SelfExplain, a novel self-explaining model that explains a text classifier's predictions using phrase-based concepts. SelfExplain augments existing neural classifiers by adding (1) a globally interpretable layer that identifies…

Computation and Language · Computer Science 2021-09-09 Dheeraj Rajagopal , Vidhisha Balachandran , Eduard Hovy , Yulia Tsvetkov

Fooling the Textual Fooler via Randomizing Latent Representations

Despite outstanding performance in a variety of NLP tasks, recent studies have revealed that NLP models are vulnerable to adversarial attacks that slightly perturb the input to cause the models to misbehave. Among these attacks, adversarial…

Computation and Language · Computer Science 2024-06-11 Duy C. Hoang , Quang H. Nguyen , Saurav Manchanda , MinLong Peng , Kok-Seng Wong , Khoa D. Doan

Towards Explainable NLP: A Generative Explanation Framework for Text Classification

Building explainable systems is a critical problem in the field of Natural Language Processing (NLP), since most machine learning models provide no explanations for the predictions. Existing approaches for explainable machine learning…

Computation and Language · Computer Science 2019-06-12 Hui Liu , Qingyu Yin , William Yang Wang

Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment

Machine learning algorithms are often vulnerable to adversarial examples that have imperceptible alterations from the original counterparts but can fool the state-of-the-art models. It is helpful to evaluate or even improve the robustness…

Computation and Language · Computer Science 2020-04-10 Di Jin , Zhijing Jin , Joey Tianyi Zhou , Peter Szolovits

DeepFool: a simple and accurate method to fool deep neural networks

State-of-the-art deep neural networks have achieved impressive results on many image classification tasks. However, these same architectures have been shown to be unstable to small, well sought, perturbations of the images. Despite the…

Machine Learning · Computer Science 2016-08-30 Seyed-Mohsen Moosavi-Dezfooli , Alhussein Fawzi , Pascal Frossard

Towards Faithful Explanations for Text Classification with Robustness Improvement and Explanation Guided Training

Feature attribution methods highlight the important input tokens as explanations to model predictions, which have been widely applied to deep neural networks towards trustworthy AI. However, recent works show that explanations provided by…

Computation and Language · Computer Science 2024-01-01 Dongfang Li , Baotian Hu , Qingcai Chen , Shan He

Deep Text Classification Can be Fooled

In this paper, we present an effective method to craft text adversarial samples, revealing one important yet underestimated fact that DNN-based text classifiers are also prone to adversarial sample attack. Specifically, confronted with…

Cryptography and Security · Computer Science 2019-01-08 Bin Liang , Hongcheng Li , Miaoqiang Su , Pan Bian , Xirong Li , Wenchang Shi

A Theoretical Framework for Robustness of (Deep) Classifiers against Adversarial Examples

Most machine learning classifiers, including deep neural networks, are vulnerable to adversarial examples. Such inputs are typically generated by adding small but purposeful modifications that lead to incorrect outputs while imperceptible…

Machine Learning · Computer Science 2017-09-28 Beilun Wang , Ji Gao , Yanjun Qi

Universal Adversarial Perturbation for Text Classification

Given a state-of-the-art deep neural network text classifier, we show the existence of a universal and very small perturbation vector (in the embedding space) that causes natural text to be misclassified with high probability. Unlike images…

Computation and Language · Computer Science 2019-10-11 Hang Gao , Tim Oates

On the Transferability of Adversarial Attacksagainst Neural Text Classifier

Deep neural networks are vulnerable to adversarial attacks, where a small perturbation to an input alters the model prediction. In many cases, malicious inputs intentionally crafted for one model can fool another model. In this paper, we…

Machine Learning · Computer Science 2021-09-23 Liping Yuan , Xiaoqing Zheng , Yi Zhou , Cho-Jui Hsieh , Kai-wei Chang

Text Classification: Neural Networks VS Machine Learning Models VS Pre-trained Models

Text classification is a very common task nowadays and there are many efficient methods and algorithms that we can employ to accomplish it. Transformers have revolutionized the field of deep learning, particularly in Natural Language…

Machine Learning · Computer Science 2024-12-31 Christos Petridis

Tailoring Adversarial Attacks on Deep Neural Networks for Targeted Class Manipulation Using DeepFool Algorithm

The susceptibility of deep neural networks (DNNs) to adversarial attacks undermines their reliability across numerous applications, underscoring the necessity for an in-depth exploration of these vulnerabilities and the formulation of…

Computer Vision and Pattern Recognition · Computer Science 2025-04-15 S. M. Fazle Rabby Labib , Joyanta Jyoti Mondal , Meem Arafat Manab , Xi Xiao , Sarfaraz Newaz