Yucheng Fu, University of Virginia; Tianyao Gu and Elaine Shi, Carnegie Mellon University; Tianhao Wang, University of Virginia
Differentially private synthetic data generation has emerged as a powerful tool for sharing data while protecting individuals' privacy. However, when the attributes of sensitive data are distributed across multiple entities such as hospitals, companies, or government agencies, accurately generating synthetic data becomes challenging. In particular, it is difficult to capture informative statistical correlations and use them to guide data synthesis without gathering the entire private dataset. In response to this challenge, we propose a secure multi-party computation protocol for differentially private tabular data synthesis in the distributed setting. Our protocol contains two new primitives. The first is a protocol that exploits distributed point functions to efficiently estimate two-way marginals (pairwise joint distributions of attributes) across vertically distributed data. The second is a protocol for generating noise via batched lookups in the cumulative distribution function table. As a concrete demonstration, we build a distributed version of AIM, a state-of-the-art DP data-synthesis algorithm. Our implementation achieves the same utility as its centralized version while reducing end-to-end runtime by orders of magnitude compared with prior work. For example, we can synthesize the "Adult" dataset in 24 minutes in a real-world WAN setting, whereas the existing protocol is estimated to take 57 days.
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.