Nassima Bouzid, Capital One
LLM-based simulators are promising new tools for generating synthetic data, especially under conditions that are challenging for traditional Differential Privacy (DP) methods (i.e., high-dimensional tabular data). By condensing customer attributes and behaviors into compact user profiles under DP, we can seed an LLM with data to generate realistic customer transactions. We tested this "Profile-then-Simulate" approach on financial transaction data using PersonaLedger, an LLM-based generator, and compared it to direct DP synthesis on the same dataset.
We found that the LLM-based approach produces usable synthetic data, but direct synthesis still significantly outperforms it on both fraud detection utility and distributional fidelity. We identified systematic LLM biases, not DP noise, as the dominant source of error. The model's learned priors about "typical" financial behavior consistently overrode the statistical distributions we provided as input, particularly for demographic and categorical features, resulting in divergent output data.
This talk shares practical lessons for privacy engineers considering generative AI for synthetic data: (1) LLM biases may dominate DP noise as the primary source of distributional error; (2) direct DP synthesis remains competitive for tractable datasets; and (3) rigorous fidelity evaluation is essential before deploying LLM-generated synthetic data in production pipelines.
Coauthors: Dehao Yuan, Nam H. Nguyen, Mayana Pereira

Nassima Bouzid is a Senior Machine Learning Engineer at Capital One, where she focuses on differential privacy and privacy-enhancing technologies. She holds a PhD in evolutionary biology from the University of Washington, where she studied diversification and environmental adaptation of lizards in Yosemite National Park. From genetic testing to insurance operations to fintech, she's consistently drawn to problems without established playbooks. Her work sits at the intersection of applied research and engineering, translating ambiguous, cross-domain problems into concrete, measurable solutions.

author = {Nassima Bouzid},
title = {{Profile-Then-Simulate}: Can {LLMs} Faithfully Generate Differentially Private Synthetic Data?},
year = {2026},
address = {Santa Clara, CA},
publisher = {USENIX Association},
month = jun
}
