Related papers: Weight-sparse transformers have interpretable circ…

Supervised Dictionary Learning

It is now well established that sparse signal models are well suited to restoration tasks and can effectively be learned from audio, image, and video data. Recent research has been aimed at learning discriminative sparse models instead of…

Computer Vision and Pattern Recognition · Computer Science 2009-09-29 Julien Mairal , Francis Bach , Jean Ponce , Guillermo Sapiro , Andrew Zisserman

Extracting Interpretable Task-Specific Circuits from Large Language Models for Faster Inference

Large Language Models (LLMs) have shown impressive performance across a wide range of tasks. However, the size of LLMs is steadily increasing, hindering their application on computationally constrained environments. On the other hand,…

Machine Learning · Computer Science 2024-12-23 Jorge García-Carrasco , Alejandro Maté , Juan Trujillo

MonoNet: Towards Interpretable Models by Learning Monotonic Features

Being able to interpret, or explain, the predictions made by a machine learning model is of fundamental importance. This is especially true when there is interest in deploying data-driven models to make high-stakes decisions, e.g. in…

Machine Learning · Computer Science 2019-10-01 An-phi Nguyen , María Rodríguez Martínez

Linguistic Interpretability of Transformer-based Language Models: a systematic review

Language models based on the Transformer architecture achieve excellent results in many language-related tasks, such as text classification or sentiment analysis. However, despite the architecture of these models being well-defined, little…

Computation and Language · Computer Science 2025-04-14 Miguel López-Otal , Jorge Gracia , Jordi Bernad , Carlos Bobed , Lucía Pitarch-Ballesteros , Emma Anglés-Herrero

Rigorous Interpretation Is a Form of Evaluation

Current machine learning models are evaluated through behavioral snapshots, with benchmark accuracies, win rates and outcome-based metrics. Model explanations and evaluations, however, are fundamentally intertwined: understanding why a…

Computers and Society · Computer Science 2026-05-08 Isabelle Lee , Emmy Liu , Cathy Jiao , Brihi Joshi , Dani Yogatama , Fazl Barez , Michael Saxon

Improving Neuron-level Interpretability with White-box Language Models

Neurons in auto-regressive language models like GPT-2 can be interpreted by analyzing their activation patterns. Recent studies have shown that techniques such as dictionary learning, a form of post-hoc sparse coding, enhance this…

Computation and Language · Computer Science 2025-02-28 Hao Bai , Yi Ma

Decomposing Representation Space into Interpretable Subspaces with Unsupervised Learning

Understanding internal representations of neural models is a core interest of mechanistic interpretability. Due to its large dimensionality, the representation space can encode various aspects about inputs. To what extent are different…

Machine Learning · Computer Science 2026-05-15 Xinting Huang , Michael Hahn

Human-in-the-Loop Interpretability Prior

We often desire our models to be interpretable as well as accurate. Prior work on optimizing models for interpretability has relied on easy-to-quantify proxies for interpretability, such as sparsity or the number of operations required. In…

Machine Learning · Statistics 2018-11-01 Isaac Lage , Andrew Slavin Ross , Been Kim , Samuel J. Gershman , Finale Doshi-Velez

Analyzing Transformers in Embedding Space

Understanding Transformer-based models has attracted significant attention, as they lie at the heart of recent technological advances across machine learning. While most interpretability methods rely on running models over inputs, recent…

Computation and Language · Computer Science 2023-12-27 Guy Dar , Mor Geva , Ankit Gupta , Jonathan Berant

Is Sparse Attention more Interpretable?

Sparse attention has been claimed to increase model interpretability under the assumption that it highlights influential inputs. Yet the attention distribution is typically over representations internal to the model rather than the inputs…

Computation and Language · Computer Science 2021-06-09 Clara Meister , Stefan Lazov , Isabelle Augenstein , Ryan Cotterell

Dissecting Jet-Tagger Through Mechanistic Interpretability

Mechanistic interpretability seeks to reverse engineer a trained neural network by identifying the minimal subset of internal components. We perform a mechanistic interpretability analysis of the Particle Transformer architecture, trained…

High Energy Physics - Phenomenology · Physics 2026-05-12 Saurabh Rai , Sanmay Ganguly

Predictive Concept Decoders: Training Scalable End-to-End Interpretability Assistants

Interpreting the internal activations of neural networks can produce more faithful explanations of their behavior, but is difficult due to the complex structure of activation space. Existing approaches to scalable interpretability use…

Artificial Intelligence · Computer Science 2025-12-18 Vincent Huang , Dami Choi , Daniel D. Johnson , Sarah Schwettmann , Jacob Steinhardt

Optimal Explanations of Linear Models

When predictive models are used to support complex and important decisions, the ability to explain a model's reasoning can increase trust, expose hidden biases, and reduce vulnerability to adversarial attacks. However, attempts at…

Machine Learning · Computer Science 2019-07-11 Dimitris Bertsimas , Arthur Delarue , Patrick Jaillet , Sebastien Martin

Mechanistic Interpretability for AI Safety -- A Review

Understanding AI systems' inner workings is critical for ensuring value alignment and safety. This review explores mechanistic interpretability: reverse engineering the computational mechanisms and representations learned by neural networks…

Artificial Intelligence · Computer Science 2024-08-27 Leonard Bereska , Efstratios Gavves

ParseCaps: An Interpretable Parsing Capsule Network for Medical Image Diagnosis

Deep learning has excelled in medical image classification, but its clinical application is limited by poor interpretability. Capsule networks, known for encoding hierarchical relationships and spatial features, show potential in addressing…

Computer Vision and Pattern Recognition · Computer Science 2024-11-05 Xinyu Geng , Jiaming Wang , Jun Xu

Meaningful Models: Utilizing Conceptual Structure to Improve Machine Learning Interpretability

The last decade has seen huge progress in the development of advanced machine learning models; however, those models are powerless unless human users can interpret them. Here we show how the mind's construction of concepts and meaning can…

Machine Learning · Statistics 2016-07-04 Nick Condry

Flexible Model Interpretability through Natural Language Model Editing

Model interpretability and model editing are crucial goals in the age of large language models. Interestingly, there exists a link between these two goals: if a method is able to systematically edit model behavior with regard to a human…

Computation and Language · Computer Science 2023-11-21 Karel D'Oosterlinck , Thomas Demeester , Chris Develder , Christopher Potts

Beyond Sparsity: Tree Regularization of Deep Models for Interpretability

The lack of interpretability remains a key barrier to the adoption of deep models in many applications. In this work, we explicitly regularize deep models so human users might step through the process behind their predictions in little…

Machine Learning · Statistics 2017-11-17 Mike Wu , Michael C. Hughes , Sonali Parbhoo , Maurizio Zazzi , Volker Roth , Finale Doshi-Velez

On the definition and importance of interpretability in scientific machine learning

Though neural networks trained on large datasets have been successfully used to describe and predict many physical phenomena, there is a sense among scientists that, unlike traditional scientific models comprising simple mathematical…

Machine Learning · Computer Science 2026-04-23 Conor Rowan , Alireza Doostan

Sparse Relational Reasoning with Object-Centric Representations

We investigate the composability of soft-rules learned by relational neural architectures when operating over object-centric (slot-based) representations, under a variety of sparsity-inducing constraints. We find that increasing sparsity,…

Machine Learning · Computer Science 2022-07-18 Alex F. Spies , Alessandra Russo , Murray Shanahan