Related papers: Using Interpretation Methods for Model Enhancement

Evaluating Neuron Interpretation Methods of NLP Models

Neuron Interpretation has gained traction in the field of interpretability, and have provided fine-grained insights into what a model learns and how language knowledge is distributed amongst its different components. However, the lack of…

Computation and Language · Computer Science 2023-11-07 Yimin Fan , Fahim Dalvi , Nadir Durrani , Hassan Sajjad

Model Interpretability and Rationale Extraction by Input Mask Optimization

Concurrent to the rapid progress in the development of neural-network based models in areas like natural language processing and computer vision, the need for creating explanations for the predictions of these black-box models has risen…

Computation and Language · Computer Science 2025-08-18 Marc Brinner , Sina Zarriess

Interpreting Deep Learning Models in Natural Language Processing: A Review

Neural network models have achieved state-of-the-art performances in a wide range of natural language processing (NLP) tasks. However, a long-standing criticism against neural network models is the lack of interpretability, which not only…

Computation and Language · Computer Science 2021-10-26 Xiaofei Sun , Diyi Yang , Xiaoya Li , Tianwei Zhang , Yuxian Meng , Han Qiu , Guoyin Wang , Eduard Hovy , Jiwei Li

A Framework to Learn with Interpretation

To tackle interpretability in deep learning, we present a novel framework to jointly learn a predictive model and its associated interpretation model. The interpreter provides both local and global interpretability about the predictive…

Machine Learning · Computer Science 2022-02-24 Jayneel Parekh , Pavlo Mozharovskyi , Florence d'Alché-Buc

FICNN: A Framework for the Interpretation of Deep Convolutional Neural Networks

With the continue development of Convolutional Neural Networks (CNNs), there is a growing concern regarding representations that they encode internally. Analyzing these internal representations is referred to as model interpretation. While…

Computer Vision and Pattern Recognition · Computer Science 2023-05-18 Hamed Behzadi-Khormouji , José Oramas

Towards Interpretable Natural Language Understanding with Explanations as Latent Variables

Recently generating natural language explanations has shown very promising results in not only offering interpretable explanations but also providing additional information and supervision for prediction. However, existing approaches…

Computation and Language · Computer Science 2022-05-30 Wangchunshu Zhou , Jinyi Hu , Hanlin Zhang , Xiaodan Liang , Maosong Sun , Chenyan Xiong , Jian Tang

A survey on improving NLP models with human explanations

Training a model with access to human explanations can improve data efficiency and model performance on in- and out-of-domain data. Adding to these empirical findings, similarity with the process of human learning makes learning from…

Computation and Language · Computer Science 2022-04-20 Mareike Hartmann , Daniel Sonntag

Interpretation of Prediction Models Using the Input Gradient

State of the art machine learning algorithms are highly optimized to provide the optimal prediction possible, naturally resulting in complex models. While these models often outperform simpler more interpretable models by order of…

Machine Learning · Statistics 2016-11-24 Yotam Hechtlinger

Interpretability via Model Extraction

The ability to interpret machine learning models has become increasingly important now that machine learning is used to inform consequential decisions. We propose an approach called model extraction for interpreting complex, blackbox…

Machine Learning · Computer Science 2018-03-14 Osbert Bastani , Carolyn Kim , Hamsa Bastani

Fooling Neural Network Interpretations via Adversarial Model Manipulation

We ask whether the neural network interpretation methods can be fooled via adversarial model manipulation, which is defined as a model fine-tuning step that aims to radically alter the explanations without hurting the accuracy of the…

Machine Learning · Computer Science 2019-11-04 Juyeon Heo , Sunghwan Joo , Taesup Moon

Deep Model Interpretation with Limited Data : A Coreset-based Approach

Model Interpretation aims at the extraction of insights from the internals of a trained model. A common approach to address this task is the characterization of relevant features internally encoded in the model that are critical for its…

Machine Learning · Computer Science 2024-10-07 Hamed Behzadi-Khormouji , José Oramas

Combining Transformers with Natural Language Explanations

Many NLP applications require models to be interpretable. However, many successful neural architectures, including transformers, still lack effective interpretation methods. A possible solution could rely on building explanations from…

Computation and Language · Computer Science 2024-04-04 Federico Ruggeri , Marco Lippi , Paolo Torroni

AllenNLP Interpret: A Framework for Explaining Predictions of NLP Models

Neural NLP models are increasingly accurate but are imperfect and opaque---they break in counterintuitive ways and leave end users puzzled at their behavior. Model interpretation methods ameliorate this opacity by providing explanations for…

Computation and Language · Computer Science 2019-09-23 Eric Wallace , Jens Tuyls , Junlin Wang , Sanjay Subramanian , Matt Gardner , Sameer Singh

Rigorous Interpretation Is a Form of Evaluation

Current machine learning models are evaluated through behavioral snapshots, with benchmark accuracies, win rates and outcome-based metrics. Model explanations and evaluations, however, are fundamentally intertwined: understanding why a…

Computers and Society · Computer Science 2026-05-08 Isabelle Lee , Emmy Liu , Cathy Jiao , Brihi Joshi , Dani Yogatama , Fazl Barez , Michael Saxon

Feature-Based Interpretable Surrogates for Optimization

For optimization models to be used in practice, it is crucial that users trust the results. A key factor in this aspect is the interpretability of the solution process. A previous framework for inherently interpretable optimization models…

Optimization and Control · Mathematics 2026-02-13 Marc Goerigk , Michael Hartisch , Sebastian Merten , Kartikey Sharma

Augmenting Interpretable Models with LLMs during Training

Recent large language models (LLMs) have demonstrated remarkable prediction performance for a growing array of tasks. However, their proliferation into high-stakes domains (e.g. medicine) and compute-limited settings has created a…

Artificial Intelligence · Computer Science 2023-12-05 Chandan Singh , Armin Askari , Rich Caruana , Jianfeng Gao

From Human Explanation to Model Interpretability: A Framework Based on Weight of Evidence

We take inspiration from the study of human explanation to inform the design and evaluation of interpretability methods in machine learning. First, we survey the literature on human explanation in philosophy, cognitive science, and the…

Artificial Intelligence · Computer Science 2021-09-21 David Alvarez-Melis , Harmanpreet Kaur , Hal Daumé , Hanna Wallach , Jennifer Wortman Vaughan

Exploring the Robustness of Model-Graded Evaluations and Automated Interpretability

There has been increasing interest in evaluations of language models for a variety of risks and characteristics. Evaluations relying on natural language understanding for grading can often be performed at scale by using other language…

Computation and Language · Computer Science 2023-12-11 Simon Lermen , Ondřej Kvapil

Interpretable machine learning: definitions, methods, and applications

Machine-learning models have demonstrated great success in learning complex patterns that enable them to make predictions about unobserved data. In addition to using models for prediction, the ability to interpret what a model has learned…

Machine Learning · Statistics 2019-11-15 W. James Murdoch , Chandan Singh , Karl Kumbier , Reza Abbasi-Asl , Bin Yu

Meaningful Models: Utilizing Conceptual Structure to Improve Machine Learning Interpretability

The last decade has seen huge progress in the development of advanced machine learning models; however, those models are powerless unless human users can interpret them. Here we show how the mind's construction of concepts and meaning can…

Machine Learning · Statistics 2016-07-04 Nick Condry