Related papers: An Open Source Python Library for Anonymizing Sens…
Openly sharing data with sensitive attributes and privacy restrictions is a challenging task. In this document we present the implementation of pyCANON, a Python library and command line interface (CLI) to check and assess the level of…
Recently introduced privacy legislation has aimed to restrict and control the amount of personal data published by companies and shared to third parties. Much of this real data is not only sensitive requiring anonymization, but also…
In this document, we present a state of the art of anonymization techniques for classical tabular datasets. This article is geared towards a general public having some knowledge of mathematics and computer science, but with no need for…
Anonymization is a foundational principle of data privacy regulation, yet its practical application remains riddled with ambiguity and inconsistency. This paper introduces the concept of anonymity-washing -- the misrepresentation of the…
This work investigates the effectiveness of different pseudonymization techniques, ranging from rule-based substitutions to using pre-trained Large Language Models (LLMs), on a variety of datasets and models used for two widely used NLP…
In medical organizations large amount of personal data are collected and analyzed by the data miner or researcher, for further perusal. However, the data collected may contain sensitive information such as specific disease of a patient and…
Since its conception in 2006, differential privacy has emerged as the de-facto standard in data privacy, owing to its robust mathematical guarantees, generalised applicability and rich body of literature. Over the years, researchers have…
Text anonymization is the process of removing or obfuscating information from textual data to protect the privacy of individuals. This process inherently involves a complex trade-off between privacy protection and information preservation,…
Publishing person-specific transactions in an anonymous form is increasingly required by organizations. Recent approaches ensure that potentially identifying information (e.g., a set of diagnosis codes) cannot be used to link published…
Data protection algorithms are becoming increasingly important to support modern business needs for facilitating data sharing and data monetization. Anonymization is an important step before data sharing. Several organizations leverage on…
Social networks have become an essential meeting point for millions of individuals willing to publish and consume huge quantities of heterogeneous information. Some studies have shown that the data published in these platforms may contain…
Enormous amounts of data collected from social networks or other online platforms are being published for the sake of statistics, marketing, and research, among other objectives. The consequent privacy and data security concerns have…
User-driven privacy allows individuals to control whether and at what granularity their data is shared, leading to datasets that mix original, generalized, and missing values within the same records and attributes. While such…
A firm seeks to analyze a dataset and to release the results. The dataset contains information about individual people, and the firm is subject to some regulation that forbids the release of the dataset itself. The regulation also imposes…
Numerous generalization techniques have been proposed for privacy preserving data publishing. Most existing techniques, however, implicitly assume that the adversary knows little about the anonymization algorithm adopted by the data…
Mining health data can lead to faster medical decisions, improvement in the quality of treatment, disease prevention, reduced cost, and it drives innovative solutions within the healthcare sector. However, health data is highly sensitive…
Process mining techniques such as process discovery and conformance checking provide insights into actual processes by analyzing event data that are widely available in information systems. These data are very valuable, but often contain…
The recomputability and reproducibility of results from scientific software requires access to both the source code and all associated input and output data. However, the full collection of these resources often does not accompany the key…
We consider the privacy problem in data publishing: given a relation I containing sensitive information 'anonymize' it to obtain a view V such that, on one hand attackers cannot learn any sensitive information from V, and on the other hand…
OpenData movement around the globe is demanding more access to information which lies locked in public or private servers. As recently reported by a McKinsey publication, this data has significant economic value, yet its release has…