Related papers: Unsupervised Text Deidentification

Towards De-identification of Legal Texts

In many countries, personal information that can be published or shared between organizations is regulated and, therefore, documents must undergo a process of de-identification to eliminate or obfuscate confidential data. Our work focuses…

Computation and Language · Computer Science 2019-10-10 Diego Garat , Dina Wonsever

Neural Text Sanitization with Privacy Risk Indicators: An Empirical Analysis

Text sanitization is the task of redacting a document to mask all occurrences of (direct or indirect) personal identifiers, with the goal of concealing the identity of the individual(s) referred in it. In this paper, we consider a two-step…

Computation and Language · Computer Science 2023-10-24 Anthi Papadopoulou , Pierre Lison , Mark Anderson , Lilja Øvrelid , Ildikó Pilán

Re-identification of De-identified Documents with Autoregressive Infilling

Documents revealing sensitive information about individuals must typically be de-identified. This de-identification is often done by masking all mentions of personally identifiable information (PII), thereby making it more difficult to…

Computation and Language · Computer Science 2025-05-20 Lucas Georges Gabriel Charpentier , Pierre Lison

Protecting De-identified Documents from Search-based Linkage Attacks

While de-identification models can help conceal the identity of the individuals mentioned in a document, they fail to address linkage risks, defined as the potential to map the de-identified text back to its source. One straightforward way…

Computation and Language · Computer Science 2026-03-18 Pierre Lison , Mark Anderson

Stronger Re-identification Attacks through Reasoning and Aggregation

Text de-identification techniques are often used to mask personally identifiable information (PII) from documents. Their ability to conceal the identity of the individuals mentioned in a text is, however, hard to measure. Recent work has…

Computation and Language · Computer Science 2025-10-13 Lucas Georges Gabriel Charpentier , Pierre Lison

RedactBuster: Entity Type Recognition from Redacted Documents

The widespread exchange of digital documents in various domains has resulted in abundant private information being shared. This proliferation necessitates redaction techniques to protect sensitive content and user privacy. While numerous…

Cryptography and Security · Computer Science 2024-04-22 Mirco Beltrame , Mauro Conti , Pierpaolo Guglielmin , Francesco Marchiori , Gabriele Orazi

Keep It Private: Unsupervised Privatization of Online Text

Authorship obfuscation techniques hold the promise of helping people protect their privacy in online communications by automatically rewriting text to hide the identity of the original author. However, obfuscation has been evaluated in…

Computation and Language · Computer Science 2024-05-17 Calvin Bao , Marine Carpuat

Anonymization of Documents for Law Enforcement with Machine Learning

The steadily increasing utilization of data-driven methods and approaches in areas that handle sensitive personal information such as in law enforcement mandates an ever increasing effort in these institutions to comply with data protection…

Artificial Intelligence · Computer Science 2025-01-14 Manuel Eberhardinger , Patrick Takenaka , Daniel Grießhaber , Johannes Maucher

A review of Recent Techniques for Person Re-Identification

Person re-identification (ReId), a crucial task in surveillance, involves matching individuals across different camera views. The advent of Deep Learning, especially supervised techniques like Convolutional Neural Networks and Attention…

Computer Vision and Pattern Recognition · Computer Science 2025-10-27 Andrea Asperti , Salvatore Fiorilla , Simone Nardi , Lorenzo Orsini

Truthful Text Sanitization Guided by Inference Attacks

Text sanitization aims to rewrite parts of a document to prevent disclosure of personal information. The central challenge of text sanitization is to strike a balance between privacy protection (avoiding the leakage of personal information)…

Computation and Language · Computer Science 2025-09-03 Ildikó Pilán , Benet Manzanares-Salor , David Sánchez , Pierre Lison

An Easy-to-use and Robust Approach for the Differentially Private De-Identification of Clinical Textual Documents

Unstructured textual data is at the heart of healthcare systems. For obvious privacy reasons, these documents are not accessible to researchers as long as they contain personally identifiable information. One way to share this data while…

Cryptography and Security · Computer Science 2022-11-03 Yakini Tchouka , Jean-François Couchot , David Laiymani

Bootstrapping Text Anonymization Models with Distant Supervision

We propose a novel method to bootstrap text anonymization models based on distant supervision. Instead of requiring manually labeled training data, the approach relies on a knowledge graph expressing the background information assumed to be…

Computation and Language · Computer Science 2022-05-17 Anthi Papadopoulou , Pierre Lison , Lilja Øvrelid , Ildikó Pilán

De-identification of Privacy-related Entities in Job Postings

De-identification is the task of detecting privacy-related entities in text, such as person names, emails and contact data. It has been well-studied within the medical domain. The need for de-identification technology is increasing, as…

Computation and Language · Computer Science 2021-05-25 Kristian Nørgaard Jensen , Mike Zhang , Barbara Plank

Towards Quantifying The Privacy Of Redacted Text

In this paper we propose use of a k-anonymity-like approach for evaluating the privacy of redacted text. Given a piece of redacted text we use a state of the art transformer-based deep learning network to reconstruct the original text. This…

Machine Learning · Computer Science 2024-10-11 Vaibhav Gusain , Douglas Leith

Man vs the machine: The Struggle for Effective Text Anonymisation in the Age of Large Language Models

The collection and use of personal data are becoming more common in today's data-driven culture. While there are many advantages to this, including better decision-making and service delivery, it also poses significant ethical issues around…

Cryptography and Security · Computer Science 2023-03-23 Constantinos Patsakis , Nikolaos Lykousas

RAT-Bench: A Comprehensive Benchmark for Text Anonymization

Data containing personal information is increasingly used to train, fine-tune, or query Large Language Models (LLMs). Text is typically scrubbed of identifying information prior to use, often with tools such as Microsoft's Presidio or…

Computation and Language · Computer Science 2026-02-16 Nataša Krčo , Zexi Yao , Matthieu Meeus , Yves-Alexandre de Montjoye

Textwash -- automated open-source text anonymisation

The increased use of text data in social science research has benefited from easy-to-access data (e.g., Twitter). That trend comes at the cost of research requiring sensitive but hard-to-share data (e.g., interview data, police reports,…

Computation and Language · Computer Science 2022-08-30 Bennett Kleinberg , Toby Davies , Maximilian Mozes

Privacy Guarantees for De-identifying Text Transformations

Machine Learning approaches to Natural Language Processing tasks benefit from a comprehensive collection of real-life user data. At the same time, there is a clear need for protecting the privacy of the users whose data is collected and…

Computation and Language · Computer Science 2022-11-16 David Ifeoluwa Adelani , Ali Davody , Thomas Kleinbauer , Dietrich Klakow

Enhancing Clinical Models with Pseudo Data for De-identification

Many models are pretrained on redacted text for privacy reasons. Clinical foundation models are often trained on de-identified text, which uses special syntax (masked) text in place of protected health information. Even though these models…

Computation and Language · Computer Science 2025-06-18 Paul Landes , Aaron J Chaise , Tarak Nath Nandi , Ravi K Madduri

Statistical anonymity: Quantifying reidentification risks without reidentifying users

Data anonymization is an approach to privacy-preserving data release aimed at preventing participants reidentification, and it is an important alternative to differential privacy in applications that cannot tolerate noisy data. Existing…

Data Structures and Algorithms · Computer Science 2022-01-31 Gecia Bravo-Hermsdorff , Robert Busa-Fekete , Lee M. Gunderson , Andrés Munõz Medina , Umar Syed