Unsupervised Text Deidentification

John X. Morris; Justin T. Chiu; Ramin Zabih; Alexander M. Rush

Unsupervised Text Deidentification

Computation and Language 2022-10-24 v1

Authors: John X. Morris , Justin T. Chiu , Ramin Zabih , Alexander M. Rush

Abstract

Deidentification seeks to anonymize textual data prior to distribution. Automatic deidentification primarily uses supervised named entity recognition from human-labeled data points. We propose an unsupervised deidentification method that masks words that leak personally-identifying information. The approach utilizes a specially trained reidentification model to identify individuals from redacted personal documents. Motivated by K-anonymity based privacy, we generate redactions that ensure a minimum reidentification rank for the correct profile of the document. To evaluate this approach, we consider the task of deidentifying Wikipedia Biographies, and evaluate using an adversarial reidentification metric. Compared to a set of unsupervised baselines, our approach deidentifies documents more completely while removing fewer words. Qualitatively, we see that the approach eliminates many identifying aspects that would fall outside of the common named entity based approach.

Keywords

biometric authentication information retrieval differential privacy

Cite

@article{arxiv.2210.11528,
  title  = {Unsupervised Text Deidentification},
  author = {John X. Morris and Justin T. Chiu and Ramin Zabih and Alexander M. Rush},
  journal= {arXiv preprint arXiv:2210.11528},
  year   = {2022}
}

Comments

Findings of EMNLP 2022

Unsupervised Text Deidentification

Abstract

Keywords

Cite

Comments

Related papers