English
Related papers

Related papers: Weight-sparse transformers have interpretable circ…

200 papers

An essential goal in mechanistic interpretability to decode a network, i.e., to convert a neural network's raw weights to an interpretable algorithm. Given the difficulty of the decoding problem, progress has been made to understand the…

Machine Learning · Computer Science 2023-12-07 Isaac Liao , Ziming Liu , Max Tegmark

Large Language Models (LLMs) are unable to reliably reason about specific physical systems. Attempts to imbue LLMs with knowledge of the necessary physics concepts have shown great promise, but explainability and validation remain open…

Artificial Intelligence · Computer Science 2026-05-22 Sean Memery , Kartic Subr

Balancing predictive power and interpretability has long been a challenging research area, particularly in powerful yet complex models like neural networks, where nonlinearity obstructs direct interpretation. This paper introduces a novel…

Machine Learning · Computer Science 2025-02-20 Antoine Ledent , Peng Liu

Large Transformer models yield impressive results on many tasks, but are expensive to train, or even fine-tune, and so slow at decoding that their use and study becomes out of reach. We address this problem by leveraging sparsity. We study…

Mechanistic interpretability is concerned with analyzing individual components in a (convolutional) neural network (CNN) and how they form larger circuits representing decision mechanisms. These investigations are challenging since CNNs…

Computer Vision and Pattern Recognition · Computer Science 2025-04-18 Robin Hesse , Jonas Fischer , Simone Schaub-Meyer , Stefan Roth

Attention mechanisms have become ubiquitous in NLP. Recent architectures, notably the Transformer, learn powerful context-aware word representations through layered, multi-headed attention. The multiple heads learn diverse types of word…

Computation and Language · Computer Science 2019-09-09 Gonçalo M. Correia , Vlad Niculae , André F. T. Martins

Neural network models are widely used in solving many challenging problems, such as computer vision, personalized recommendation, and natural language processing. Those models are very computationally intensive and reach the hardware limit…

Machine Learning · Computer Science 2020-04-28 Fei Sun , Minghai Qin , Tianyun Zhang , Liu Liu , Yen-Kuang Chen , Yuan Xie

Sparse autoencoders provide a promising unsupervised approach for extracting interpretable features from a language model by reconstructing activations from a sparse bottleneck layer. Since language models learn many concepts, autoencoders…

Machine Learning · Computer Science 2024-06-07 Leo Gao , Tom Dupré la Tour , Henk Tillman , Gabriel Goh , Rajan Troll , Alec Radford , Ilya Sutskever , Jan Leike , Jeffrey Wu

Transparency of neural networks' internal reasoning is at the heart of interpretability research, adding to trust, safety, and understanding of these models. The field of mechanistic interpretability has recently focused on studying…

Artificial Intelligence · Computer Science 2026-04-17 Nina Żukowska , Wolfgang Stammer , Bernt Schiele , Jonas Fischer

Sparse linear models are one of several core tools for interpretable machine learning, a field of emerging importance as predictive models permeate decision-making in many domains. Unfortunately, sparse linear models are far less flexible…

Machine Learning · Statistics 2024-01-03 Ryan Thompson , Amir Dezfouli , Robert Kohn

Interpretable machine learning tackles the important problem that humans cannot understand the behaviors of complex machine learning models and how these models arrive at a particular decision. Although many approaches have been proposed, a…

Machine Learning · Computer Science 2019-05-21 Mengnan Du , Ninghao Liu , Xia Hu

Algorithm extraction aims to synthesize executable programs directly from models trained on algorithmic tasks, enabling de novo algorithm discovery without relying on human-written code. However, applying this paradigm to Transformer is…

Machine Learning · Computer Science 2026-03-20 Yifan Zhang , Wei Bi , Kechi Zhang , Dongming Jin , Jie Fu , Zhi Jin

Interpretability methods aim to understand the algorithm implemented by a trained model (e.g., a Transofmer) by examining various aspects of the model, such as the weight matrices or the attention patterns. In this work, through a…

Machine Learning · Computer Science 2023-12-05 Kaiyue Wen , Yuchen Li , Bingbin Liu , Andrej Risteski

Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks' capabilities in order to accomplish concrete scientific and engineering goals. Progress in this field thus promises to provide greater…

Transformer-based models generate hidden states that are difficult to interpret. In this work, we analyze hidden states and modify them at inference, with a focus on motion forecasting. We use linear probing to analyze whether interpretable…

Machine Learning · Computer Science 2025-05-19 Omer Sahin Tas , Royden Wagner

While many recent methods aim to unlearn or remove knowledge from pretrained models, seemingly erased knowledge often persists and can be recovered in various ways. Because large foundation models are far from interpretable, understanding…

Machine Learning · Computer Science 2026-02-24 Shingo Kodama , Niv Cohen , Micah Adler , Nir Shavit

Transformers have become the foundational architecture for a broad spectrum of sequence modeling applications, underpinning state-of-the-art systems in natural language processing, vision, and beyond. However, their theoretical limitations…

Computation and Language · Computer Science 2026-02-13 Michelle Yuan , Weiyi Sun , Amir H. Rezaeian , Jyotika Singh , Sandip Ghoshal , Yao-Ting Wang , Miguel Ballesteros , Yassine Benajiba

The interpretability of ML models is important, but it is not clear what it amounts to. So far, most philosophers have discussed the lack of interpretability of black-box models such as neural networks, and methods such as explainable AI…

Machine Learning · Computer Science 2024-01-05 Tim Räz

The increasing use of complex machine learning models in education has led to concerns about their interpretability, which in turn has spurred interest in developing explainability techniques that are both faithful to the model's inner…

Machine Learning · Computer Science 2025-05-13 Juan D. Pinto , Luc Paquette

Mechanistic interpretability (MI) is an emerging sub-field of interpretability that seeks to understand a neural network model by reverse-engineering its internal computations. Recently, MI has garnered significant attention for…

Artificial Intelligence · Computer Science 2025-10-14 Daking Rai , Yilun Zhou , Shi Feng , Abulhair Saparov , Ziyu Yao