Yash Maurya and Aman Priyanshu, Carnegie Mellon University
Modern synthetic data generation with privacy guarantees has become increasingly prevalent. Take real data, create synthetic versions following similar patterns, and ensure privacy through differential privacy mechanisms. But what happens when theoretical privacy guarantees meet real-world data? Even with conservative epsilon values (ε<10), document formatting and contextual patterns can create unexpected privacy challenges, especially when using models which aren't transparent about their own training data like most LLMs.
We explore a case study where financial synthetic data was generated with differential privacy guarantees (ε<10) using public SEC filings, yet revealed concerning privacy leakages. These findings raise important questions: Does the privacy leakage stem from the training data, or did fine-tuning untangle existing privacy controls in the base model? How do we evaluate privacy when the model's training history isn't fully known? This talk examines these challenges and brings awareness to emerging privacy considerations when generating synthetic data using modern language models.

Yash Maurya is a Privacy Engineer who evaluates empirical guarantees of real-world privacy deployments, having designed privacy-preserving systems at Meta, PwC, BNY Mellon, and Samsung. An IAPP Westin Scholar with a Master's in Privacy Engineering from Carnegie Mellon University, his research on LLM unlearning and privacy frameworks has been presented at ICLR, SaTML, and SOUPS.

Aman Priyanshu is an incoming AI Researcher at Cisco focused on AI safety & privacy. With a Masters from CMU, his research on foundation model vulnerabilities and LLM security has attracted media coverage and led to invitations to OpenAI's Red Teaming Network. His work has earned recognition through the AAAI Undergraduate Consortium Scholar award.

author = {Yash Maurya and Aman Priyanshu},
title = {When Privacy Guarantees Meet {Pre-Trained} {LLMs}: A Case Study in Synthetic Data},
year = {2025},
address = {Santa Clara, CA},
publisher = {USENIX Association},
month = jun
}
