Related papers: Privately generating tabular data using language m…

Synthetic Text Generation with Differential Privacy: A Simple and Practical Recipe

Privacy concerns have attracted increasing attention in data-driven products due to the tendency of machine learning models to memorize sensitive training data. Generating synthetic versions of such data with a formal privacy guarantee,…

Computation and Language · Computer Science 2023-07-19 Xiang Yue , Huseyin A. Inan , Xuechen Li , Girish Kumar , Julia McAnallen , Hoda Shajari , Huan Sun , David Levitan , Robert Sim

Differentially Private Data Release over Multiple Tables

We study synthetic data release for answering multiple linear queries over a set of database tables in a differentially private way. Two special cases have been considered in the literature: how to release a synthetic dataset for answering…

Databases · Computer Science 2023-06-28 Badih Ghazi , Xiao Hu , Ravi Kumar , Pasin Manurangsi

Private prediction for large-scale synthetic text generation

We present an approach for generating differentially private synthetic text using large language models (LLMs), via private prediction. In the private prediction framework, we only require the output synthetic data to satisfy differential…

Machine Learning · Computer Science 2024-10-10 Kareem Amin , Alex Bie , Weiwei Kong , Alexey Kurakin , Natalia Ponomareva , Umar Syed , Andreas Terzis , Sergei Vassilvitskii

Differentially Private Tabular Data Synthesis using Large Language Models

Synthetic tabular data generation with differential privacy is a crucial problem to enable data sharing with formal privacy. Despite a rich history of methodological research and development, developing differentially private tabular data…

Machine Learning · Computer Science 2024-06-05 Toan V. Tran , Li Xiong

Private Set Generation with Discriminative Information

Differentially private data generation techniques have become a promising solution to the data privacy challenge -- it enables sharing of data while complying with rigorous privacy guarantees, which is essential for scientific progress in…

Cryptography and Security · Computer Science 2022-11-09 Dingfan Chen , Raouf Kerkouche , Mario Fritz

Differentially Private Language Models for Secure Data Sharing

To protect the privacy of individuals whose data is being shared, it is of high importance to develop methods allowing researchers and companies to release textual data while providing formal privacy guarantees to its originators. In the…

Machine Learning · Computer Science 2022-10-27 Justus Mattern , Zhijing Jin , Benjamin Weggenmann , Bernhard Schoelkopf , Mrinmaya Sachan

Differentially Private Language Models Benefit from Public Pre-training

Language modeling is a keystone task in natural language processing. When training a language model on sensitive information, differential privacy (DP) allows us to quantify the degree to which our private data is protected. However,…

Machine Learning · Computer Science 2020-10-27 Gavin Kerrigan , Dylan Slack , Jens Tuyls

Generate synthetic samples from tabular data

Generating new samples from data sets can mitigate extra expensive operations, increased invasive procedures, and mitigate privacy issues. These novel samples that are statistically robust can be used as a temporary and intermediate…

Machine Learning · Computer Science 2022-12-26 David Banh , Alan Huang

Privacy-preserving data sharing via probabilistic modelling

Differential privacy allows quantifying privacy loss resulting from accessing sensitive personal data. Repeated accesses to underlying data incur increasing loss. Releasing data as privacy-preserving synthetic data would avoid this…

Machine Learning · Statistics 2021-06-10 Joonas Jälkö , Eemil Lagerspetz , Jari Haukka , Sasu Tarkoma , Antti Honkela , Samuel Kaski

DP-2Stage: Adapting Language Models as Differentially Private Tabular Data Generators

Generating tabular data under differential privacy (DP) protection ensures theoretical privacy guarantees but poses challenges for training machine learning models, primarily due to the need to capture complex structures under noisy…

Machine Learning · Computer Science 2025-04-30 Tejumade Afonja , Hui-Po Wang , Raouf Kerkouche , Mario Fritz

Tabular Data Synthesis with Differential Privacy: A Survey

Data sharing is a prerequisite for collaborative innovation, enabling organizations to leverage diverse datasets for deeper insights. In real-world applications like FinTech and Smart Manufacturing, transactional data, often in tabular…

Cryptography and Security · Computer Science 2024-11-07 Mengmeng Yang , Chi-Hung Chi , Kwok-Yan Lam , Jie Feng , Taolin Guo , Wei Ni

Deep Generative Models, Synthetic Tabular Data, and Differential Privacy: An Overview and Synthesis

This article provides a comprehensive synthesis of the recent developments in synthetic data generation via deep generative models, focusing on tabular datasets. We specifically outline the importance of synthetic data generation in the…

Machine Learning · Computer Science 2023-08-29 Conor Hassan , Robert Salomone , Kerrie Mengersen

A Learning Theory Approach to Non-Interactive Database Privacy

In this paper we demonstrate that, ignoring computational constraints, it is possible to privately release synthetic databases that are useful for large classes of queries -- much larger in size than the database itself. Specifically, we…

Data Structures and Algorithms · Computer Science 2011-09-13 Avrim Blum , Katrina Ligett , Aaron Roth

Generating tabular datasets under differential privacy

Machine Learning (ML) is accelerating progress across fields and industries, but relies on accessible and high-quality training data. Some of the most important datasets are found in biomedical and financial domains in the form of…

Machine Learning · Computer Science 2023-08-30 Gianluca Truda

Differentially Private Synthetic High-dimensional Tabular Stream

While differentially private synthetic data generation has been explored extensively in the literature, how to update this data in the future if the underlying private data changes is much less understood. We propose an algorithmic…

Cryptography and Security · Computer Science 2024-09-04 Girish Kumar , Thomas Strohmer , Roman Vershynin

Differentially Private Distributed Learning for Language Modeling Tasks

One of the big challenges in machine learning applications is that training data can be different from the real-world data faced by the algorithm. In language modeling, users' language (e.g. in private messaging) could change in a year and…

Computation and Language · Computer Science 2018-03-07 Vadim Popov , Mikhail Kudinov , Irina Piontkovskaya , Petr Vytovtov , Alex Nevidomsky

A Latent Class Modeling Approach for Generating Synthetic Data and Making Posterior Inferences from Differentially Private Counts

Several algorithms exist for creating differentially private counts from contingency tables, such as two-way or three-way marginal counts. The resulting noisy counts generally do not correspond to a coherent contingency table, so that some…

Methodology · Statistics 2022-01-26 Michelle Pistner Nixon , Andrés F. Barrientos , Jerome P. Reiter , Aleksandra Slavković

Harnessing large-language models to generate private synthetic text

Differentially private training algorithms like DP-SGD protect sensitive training data by ensuring that trained models do not reveal private information. An alternative approach, which this paper studies, is to use a sensitive dataset to…

Machine Learning · Computer Science 2024-01-12 Alexey Kurakin , Natalia Ponomareva , Umar Syed , Liam MacDermed , Andreas Terzis

Selective Pre-training for Private Fine-tuning

Text prediction models, when used in applications like email clients or word processors, must protect user data privacy and adhere to model size constraints. These constraints are crucial to meet memory and inference time requirements, as…

Machine Learning · Computer Science 2024-07-03 Da Yu , Sivakanth Gopi , Janardhan Kulkarni , Zinan Lin , Saurabh Naik , Tomasz Lukasz Religa , Jian Yin , Huishuai Zhang

Do You Really Need Public Data? Surrogate Public Data for Differential Privacy on Tabular Data

Differentially private (DP) machine learning often relies on the availability of public data for tasks like privacy-utility trade-off estimation, hyperparameter tuning, and pretraining. While public data assumptions may be reasonable in…

Machine Learning · Computer Science 2025-04-22 Shlomi Hod , Lucas Rosenblatt , Julia Stoyanovich