Related papers: Kernelized Concept Erasure

Linear Adversarial Concept Erasure

Modern neural models trained on textual data rely on pre-trained representations that emerge without direct supervision. As these representations are increasingly being used in real-world applications, the inability to \emph{control} their…

Machine Learning · Computer Science 2024-12-18 Shauli Ravfogel , Michael Twiton , Yoav Goldberg , Ryan Cotterell

Nonlinear Concept Erasure: a Density Matching Approach

Ensuring that neural models used in real-world applications cannot infer sensitive information, such as demographic attributes like gender or race, from text representations is a critical challenge when fairness is a concern. We address…

Machine Learning · Computer Science 2025-08-19 Antoine Saillenfest , Pirmin Lemberger

Understanding Neural Networks through Representation Erasure

While neural networks have been successfully applied to many natural language processing tasks, they come at the cost of interpretability. In this paper, we propose a general methodology to analyze and interpret decisions from a neural…

Computation and Language · Computer Science 2017-01-11 Jiwei Li , Will Monroe , Dan Jurafsky

Log-linear Guardedness and its Implications

Methods for erasing human-interpretable concepts from neural representations that assume linearity have been found to be tractable and useful. However, the impact of this removal on the behavior of downstream classifiers trained on the…

Machine Learning · Computer Science 2024-05-14 Shauli Ravfogel , Yoav Goldberg , Ryan Cotterell

A Disentangling Invertible Interpretation Network for Explaining Latent Representations

Neural networks have greatly boosted performance in computer vision by learning powerful representations of input data. The drawback of end-to-end training for maximal overall performance are black-box models whose hidden representations…

Computer Vision and Pattern Recognition · Computer Science 2020-04-29 Patrick Esser , Robin Rombach , Björn Ommer

Deep Concept Removal

We address the problem of concept removal in deep neural networks, aiming to learn representations that do not encode certain specified concepts (e.g., gender etc.) We propose a novel method based on adversarial linear classifiers trained…

Machine Learning · Computer Science 2023-10-10 Yegor Klochkov , Jean-Francois Ton , Ruocheng Guo , Yang Liu , Hang Li

TaCo: Targeted Concept Erasure Prevents Non-Linear Classifiers From Detecting Protected Attributes

Ensuring fairness in NLP models is crucial, as they often encode sensitive attributes like gender and ethnicity, leading to biased outcomes. Current concept erasure methods attempt to mitigate this by modifying final latent representations…

Computation and Language · Computer Science 2024-10-17 Fanny Jourdan , Louis Béthune , Agustin Picard , Laurent Risser , Nicholas Asher

From superposition to sparse codes: interpretable representations in neural networks

Understanding how information is represented in neural networks is a fundamental challenge in both neuroscience and artificial intelligence. Despite their nonlinear architectures, recent evidence suggests that neural networks encode…

Machine Learning · Computer Science 2025-03-04 David Klindt , Charles O'Neill , Patrik Reizinger , Harald Maurer , Nina Miolane

Concept Probing: Where to Find Human-Defined Concepts (Extended Version)

Concept probing has recently gained popularity as a way for humans to peek into what is encoded within artificial neural networks. In concept probing, additional classifiers are trained to map the internal representations of a model into…

Machine Learning · Computer Science 2025-07-28 Manuel de Sousa Ribeiro , Afonso Leote , João Leite

Where Concept Erasure Should Occur: Concept-Layer Alignment in Text-to-Video Diffusion Models

Text-to-video diffusion transformers encode semantic information unevenly across model depth, which constrains effective concept erasure. We identify a representational bottleneck, termed concept-layer topological alignment, under which…

Computer Vision and Pattern Recognition · Computer Science 2026-05-26 Yiwei Xie , Ping Liu , Zheng Zhang

A framework for analyzing concept representations in neural models

Understanding how neural models represent human-interpretable concepts is challenging. Prior work has explored linear concept subspaces from diverse perspectives, such as probing and concept erasure. We introduce a unified framework to…

Computation and Language · Computer Science 2026-05-05 Burin Naowarat , Hao Tang , Sharon Goldwater

Emergence of Concepts in DNNs?

The present paper reviews and discusses work from computer science that proposes to identify concepts in internal representations (hidden layers) of DNNs. It is examined, first, how existing methods actually identify concepts that are…

Machine Learning · Computer Science 2023-11-06 Tim Räz

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

The concept of causal abstraction got recently popularised to demystify the opaque decision-making processes of machine learning models; in short, a neural network can be abstracted as a higher-level algorithm if there exists a function…

Machine Learning · Computer Science 2025-11-13 Denis Sutter , Julian Minder , Thomas Hofmann , Tiago Pimentel

Fundamental Limits of Perfect Concept Erasure

Concept erasure is the task of erasing information about a concept (e.g., gender or race) from a representation set while retaining the maximum possible utility -- information from original representations. Concept erasure is useful in…

Machine Learning · Computer Science 2025-03-27 Somnath Basu Roy Chowdhury , Avinava Dubey , Ahmad Beirami , Rahul Kidambi , Nicholas Monath , Amr Ahmed , Snigdha Chaturvedi

Interpreting Embedding Spaces by Conceptualization

One of the main methods for computational interpretation of a text is mapping it into a vector in some embedding space. Such vectors can then be used for a variety of textual processing tasks. Recently, most embedding spaces are a product…

Computation and Language · Computer Science 2023-11-10 Adi Simhi , Shaul Markovitch

Prototype-Guided Concept Erasure in Diffusion Models

Concept erasure is extensively utilized in image generation to prevent text-to-image models from generating undesired content. Existing methods can effectively erase narrow concepts that are specific and concrete, such as distinct…

Computer Vision and Pattern Recognition · Computer Science 2026-03-10 Yuze Cai , Jiahao Lu , Hongxiang Shi , Yichao Zhou , Hong Lu

Interpretable Neural Embeddings with Sparse Self-Representation

Interpretability benefits the theoretical understanding of representations. Existing word embeddings are generally dense representations. Hence, the meaning of latent dimensions is difficult to interpret. This makes word embeddings like a…

Computation and Language · Computer Science 2023-06-27 Minxue Xia , Hao Zhu

Concept backpropagation: An Explainable AI approach for visualising learned concepts in neural network models

Neural network models are widely used in a variety of domains, often as black-box solutions, since they are not directly interpretable for humans. The field of explainable artificial intelligence aims at developing explanation methods to…

Machine Learning · Computer Science 2023-07-25 Patrik Hammersborg , Inga Strümke

Kernelized Classification in Deep Networks

We propose a kernelized classification layer for deep networks. Although conventional deep networks introduce an abundance of nonlinearity for representation (feature) learning, they almost universally use a linear classifier on the learned…

Machine Learning · Computer Science 2021-03-22 Sadeep Jayasumana , Srikumar Ramalingam , Sanjiv Kumar

Disentangling Neuron Representations with Concept Vectors

Mechanistic interpretability aims to understand how models store representations by breaking down neural networks into interpretable units. However, the occurrence of polysemantic neurons, or neurons that respond to multiple unrelated…

Computer Vision and Pattern Recognition · Computer Science 2023-04-20 Laura O'Mahony , Vincent Andrearczyk , Henning Muller , Mara Graziani