English
Related papers

Related papers: Weight-sparse transformers have interpretable circ…

200 papers

Sparse neural networks are often hypothesized to be more interpretable than dense models, motivated by findings that weight sparsity can produce compact circuits in language models. However, it remains unclear whether structural sparsity…

Computer Vision and Pattern Recognition · Computer Science 2026-03-24 Siyu Zhang

A key goal in mechanistic interpretability is circuit analysis: finding sparse subgraphs of models corresponding to specific behaviors or capabilities. However, MLP sublayers make fine-grained circuit analysis on transformer-based language…

Machine Learning · Computer Science 2024-11-08 Jacob Dunefsky , Philippe Chlenski , Neel Nanda

Recent research in mechanistic interpretability has attempted to reverse-engineer Transformer models by carefully inspecting network weights and activations. However, these approaches require considerable manual effort and still fall short…

Machine Learning · Computer Science 2023-11-01 Dan Friedman , Alexander Wettig , Danqi Chen

This paper introduces an efficient and robust method for discovering interpretable circuits in large language models using discrete sparse autoencoders. Our approach addresses key limitations of existing techniques, namely computational…

Computation and Language · Computer Science 2024-05-22 Charles O'Neill , Thang Bui

We introduce methods for discovering and applying sparse feature circuits. These are causally implicated subnetworks of human-interpretable features for explaining language model behaviors. Circuits identified in prior work consist of…

Machine Learning · Computer Science 2025-03-28 Samuel Marks , Can Rager , Eric J. Michaud , Yonatan Belinkov , David Bau , Aaron Mueller

The greatest ambition of mechanistic interpretability is to completely rewrite deep neural networks in a format that is more amenable to human understanding, while preserving their behavior and performance. In this paper, we attempt to…

Machine Learning · Computer Science 2025-02-03 Gonçalo Paulo , Nora Belrose

The fields of explainable AI and mechanistic interpretability aim to uncover the internal structure of neural networks, with circuit discovery as a central tool for understanding model computations. Existing approaches, however, rely on…

Machine Learning · Computer Science 2026-03-05 Elena Golimblevskaia , Aakriti Jain , Bruno Puri , Ammar Ibrahim , Wojciech Samek , Sebastian Lapuschkin

Word embeddings are a powerful natural language processing technique, but they are extremely difficult to interpret. To enable interpretable NLP models, we create vectors where each dimension is inherently interpretable. By inherently…

Computation and Language · Computer Science 2021-09-29 Adly Templeton

The high-level concepts that a neural network uses to perform computation need not be aligned to individual neurons (Smolensky, 1986). Language model interpretability research has thus turned to techniques such as \textit{sparse…

Computation and Language · Computer Science 2026-02-02 Aryaman Arora , Zhengxuan Wu , Jacob Steinhardt , Sarah Schwettmann

Transformer-based language models have achieved significant success; however, their internal mechanisms remain largely opaque due to the complexity of non-linear interactions and high-dimensional operations. While previous studies have…

Artificial Intelligence · Computer Science 2025-02-17 Lin Zhang , Lijie Hu , Di Wang

Prototypical part learning is emerging as a promising approach for making semantic segmentation interpretable. The model selects real patches seen during training as prototypes and constructs the dense prediction map based on the similarity…

Computer Vision and Pattern Recognition · Computer Science 2025-04-29 Hugo Porta , Emanuele Dalsasso , Diego Marcos , Devis Tuia

Interpretability benefits the theoretical understanding of representations. Existing word embeddings are generally dense representations. Hence, the meaning of latent dimensions is difficult to interpret. This makes word embeddings like a…

Computation and Language · Computer Science 2023-06-27 Minxue Xia , Hao Zhu

Which components in transformer language models are responsible for discourse understanding? We hypothesize that sparse computational graphs, termed as discursive circuits, control how models process discourse relations. Unlike simpler…

Computation and Language · Computer Science 2025-10-14 Yisong Miao , Min-Yen Kan

For reliability, it is important that the predictions made by machine learning methods are interpretable by human. In general, deep neural networks (DNNs) can provide accurate predictions, although it is difficult to interpret why such…

Machine Learning · Computer Science 2021-12-16 Yuya Yoshikawa , Tomoharu Iwata

Previous research on word embeddings has shown that sparse representations, which can be either learned on top of existing dense embeddings or obtained through model constraints during training time, have the benefit of increased…

Computation and Language · Computer Science 2018-09-26 Valentin Trifonov , Octavian-Eugen Ganea , Anna Potapenko , Thomas Hofmann

When quantitative models are used to support decision-making on complex and important topics, understanding a model's ``reasoning'' can increase trust in its predictions, expose hidden biases, or reduce vulnerability to adversarial attacks.…

Machine Learning · Computer Science 2019-07-09 Dimitris Bertsimas , Arthur Delarue , Patrick Jaillet , Sebastien Martin

Mechanistic Interpretability (MI) promises a path toward fully understanding how neural networks make their predictions. Prior work demonstrates that even when trained to perform simple arithmetic, models can implement a variety of…

Machine Learning · Computer Science 2024-05-28 Ouail Kitouni , Niklas Nolte , Víctor Samuel Pérez-Díaz , Sokratis Trifinopoulos , Mike Williams

In light of the recent widespread adoption of AI systems, understanding the internal information processing of neural networks has become increasingly critical. Most recently, machine vision has seen remarkable progress by scaling neural…

Computer Vision and Pattern Recognition · Computer Science 2024-04-02 Roland S. Zimmermann , Thomas Klein , Wieland Brendel

Supervised machine learning models boast remarkable predictive capabilities. But can you trust your model? Will it work in deployment? What else can it tell you about the world? We want models to be not only good, but interpretable. And yet…

Machine Learning · Computer Science 2017-03-07 Zachary C. Lipton

We take inspiration from the study of human explanation to inform the design and evaluation of interpretability methods in machine learning. First, we survey the literature on human explanation in philosophy, cognitive science, and the…

Artificial Intelligence · Computer Science 2021-09-21 David Alvarez-Melis , Harmanpreet Kaur , Hal Daumé , Hanna Wallach , Jennifer Wortman Vaughan
‹ Prev 1 2 3 10 Next ›