Related papers: Weight-sparse transformers have interpretable circ…

Generating Interpretable Networks using Hypernetworks

An essential goal in mechanistic interpretability to decode a network, i.e., to convert a neural network's raw weights to an interpretable algorithm. Given the difficulty of the decoding problem, progress has been made to understand the…

Machine Learning · Computer Science 2023-12-07 Isaac Liao , Ziming Liu , Max Tegmark

Discovering High Level Patterns from Simulation Traces

Large Language Models (LLMs) are unable to reliably reason about specific physical systems. Attempts to imbue LLMs with knowledge of the necessary physics concepts have shown great promise, but explainability and validation remain open…

Artificial Intelligence · Computer Science 2026-05-22 Sean Memery , Kartic Subr

Explainable Neural Networks with Guarantees: A Sparse Estimation Approach

Balancing predictive power and interpretability has long been a challenging research area, particularly in powerful yet complex models like neural networks, where nonlinearity obstructs direct interpretation. This paper introduces a novel…

Machine Learning · Computer Science 2025-02-20 Antoine Ledent , Peng Liu

Sparse is Enough in Scaling Transformers

Large Transformer models yield impressive results on many tasks, but are expensive to train, or even fine-tune, and so slow at decoding that their use and study becomes out of reach. We address this problem by leveraging sparsity. We study…

Machine Learning · Computer Science 2021-11-29 Sebastian Jaszczur , Aakanksha Chowdhery , Afroz Mohiuddin , Łukasz Kaiser , Wojciech Gajewski , Henryk Michalewski , Jonni Kanerva

Disentangling Polysemantic Channels in Convolutional Neural Networks

Mechanistic interpretability is concerned with analyzing individual components in a (convolutional) neural network (CNN) and how they form larger circuits representing decision mechanisms. These investigations are challenging since CNNs…

Computer Vision and Pattern Recognition · Computer Science 2025-04-18 Robin Hesse , Jonas Fischer , Simone Schaub-Meyer , Stefan Roth

Adaptively Sparse Transformers

Attention mechanisms have become ubiquitous in NLP. Recent architectures, notably the Transformer, learn powerful context-aware word representations through layered, multi-headed attention. The multiple heads learn diverse types of word…

Computation and Language · Computer Science 2019-09-09 Gonçalo M. Correia , Vlad Niculae , André F. T. Martins

Computation on Sparse Neural Networks: an Inspiration for Future Hardware

Neural network models are widely used in solving many challenging problems, such as computer vision, personalized recommendation, and natural language processing. Those models are very computationally intensive and reach the hardware limit…

Machine Learning · Computer Science 2020-04-28 Fei Sun , Minghai Qin , Tianyun Zhang , Liu Liu , Yen-Kuang Chen , Yuan Xie

Scaling and evaluating sparse autoencoders

Sparse autoencoders provide a promising unsupervised approach for extracting interpretable features from a language model by reconstructing activations from a sparse bottleneck layer. Since language models learn many concepts, autoencoders…

Machine Learning · Computer Science 2024-06-07 Leo Gao , Tom Dupré la Tour , Henk Tillman , Gabriel Goh , Rajan Troll , Alec Radford , Ilya Sutskever , Jan Leike , Jeffrey Wu

Seeing Through Circuits: Faithful Mechanistic Interpretability for Vision Transformers

Transparency of neural networks' internal reasoning is at the heart of interpretability research, adding to trust, safety, and understanding of these models. The field of mechanistic interpretability has recently focused on studying…

Artificial Intelligence · Computer Science 2026-04-17 Nina Żukowska , Wolfgang Stammer , Bernt Schiele , Jonas Fischer

The Contextual Lasso: Sparse Linear Models via Deep Neural Networks

Sparse linear models are one of several core tools for interpretable machine learning, a field of emerging importance as predictive models permeate decision-making in many domains. Unfortunately, sparse linear models are far less flexible…

Machine Learning · Statistics 2024-01-03 Ryan Thompson , Amir Dezfouli , Robert Kohn

Techniques for Interpretable Machine Learning

Interpretable machine learning tackles the important problem that humans cannot understand the behaviors of complex machine learning models and how these models arrive at a particular decision. Although many approaches have been proposed, a…

Machine Learning · Computer Science 2019-05-21 Mengnan Du , Ninghao Liu , Xia Hu

Weights to Code: Extracting Interpretable Algorithms from the Discrete Transformer

Algorithm extraction aims to synthesize executable programs directly from models trained on algorithmic tasks, enabling de novo algorithm discovery without relying on human-written code. However, applying this paradigm to Transformer is…

Machine Learning · Computer Science 2026-03-20 Yifan Zhang , Wei Bi , Kechi Zhang , Dongming Jin , Jie Fu , Zhi Jin

Transformers are uninterpretable with myopic methods: a case study with bounded Dyck grammars

Interpretability methods aim to understand the algorithm implemented by a trained model (e.g., a Transofmer) by examining various aspects of the model, such as the weight matrices or the attention patterns. In this work, through a…

Machine Learning · Computer Science 2023-12-05 Kaiyue Wen , Yuchen Li , Bingbin Liu , Andrej Risteski

Open Problems in Mechanistic Interpretability

Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks' capabilities in order to accomplish concrete scientific and engineering goals. Progress in this field thus promises to provide greater…

Machine Learning · Computer Science 2025-01-29 Lee Sharkey , Bilal Chughtai , Joshua Batson , Jack Lindsey , Jeff Wu , Lucius Bushnaq , Nicholas Goldowsky-Dill , Stefan Heimersheim , Alejandro Ortega , Joseph Bloom , Stella Biderman , Adria Garriga-Alonso , Arthur Conmy , Neel Nanda , Jessica Rumbelow , Martin Wattenberg , Nandi Schoots , Joseph Miller , Eric J. Michaud , Stephen Casper , Max Tegmark , William Saunders , David Bau , Eric Todd , Atticus Geiger , Mor Geva , Jesse Hoogland , Daniel Murfet , Tom McGrath

Words in Motion: Extracting Interpretable Control Vectors for Motion Transformers

Transformer-based models generate hidden states that are difficult to interpret. In this work, we analyze hidden states and modify them at inference, with a focus on motion forecasting. We use linear probing to analyze whether interpretable…

Machine Learning · Computer Science 2025-05-19 Omer Sahin Tas , Royden Wagner

Understanding Empirical Unlearning with Combinatorial Interpretability

While many recent methods aim to unlearn or remove knowledge from pretrained models, seemingly erased knowledge often persists and can be recovered in various ways. Because large foundation models are far from interpretable, understanding…

Machine Learning · Computer Science 2026-02-24 Shingo Kodama , Niv Cohen , Micah Adler , Nir Shavit

Barriers to Discrete Reasoning with Transformers: A Survey Across Depth, Exactness, and Bandwidth

Transformers have become the foundational architecture for a broad spectrum of sequence modeling applications, underpinning state-of-the-art systems in natural language processing, vision, and beyond. However, their theoretical limitations…

Computation and Language · Computer Science 2026-02-13 Michelle Yuan , Weiyi Sun , Amir H. Rezaeian , Jyotika Singh , Sandip Ghoshal , Yao-Ting Wang , Miguel Ballesteros , Yassine Benajiba

ML Interpretability: Simple Isn't Easy

The interpretability of ML models is important, but it is not clear what it amounts to. So far, most philosophers have discussed the lack of interpretability of black-box models such as neural networks, and methods such as explainable AI…

Machine Learning · Computer Science 2024-01-05 Tim Räz

A constraints-based approach to fully interpretable neural networks for detecting learner behaviors

The increasing use of complex machine learning models in education has led to concerns about their interpretability, which in turn has spurred interest in developing explainability techniques that are both faithful to the model's inner…

Machine Learning · Computer Science 2025-05-13 Juan D. Pinto , Luc Paquette

A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models

Mechanistic interpretability (MI) is an emerging sub-field of interpretability that seeks to understand a neural network model by reverse-engineering its internal computations. Recently, MI has garnered significant attention for…

Artificial Intelligence · Computer Science 2025-10-14 Daking Rai , Yilun Zhou , Shi Feng , Abulhair Saparov , Ziyu Yao