Synthetic Data, DPDPA and Tools
- Kasturi Murthy
- Mar 6
- 4 min read
A Brief Overview of Synthetic Data
Synthetic data is information that is artificially generated by computer algorithms rather than collected from real-world events or individuals. While it contains no real personal records, it accurately mimics the mathematical patterns, distributions, and correlations of the original dataset.
In the context of machine learning, this is typically achieved using generative AI techniques. For example, instead of relying on a database of highly sensitive real patient records, you might train a Variational Autoencoder (VAE) or a diffusion model on the original data. The trained model can then generate completely new, synthetic ECG waveforms that exhibit realistic R-peaks and signal noise but belong to absolutely no one. Refer to my earlier blog post titled [1]
This allows developers and researchers to test algorithms, train neural networks, and run complex analytics without ever exposing actual human data.
Synthetic Data Under Digital Personal Data Protection
India’s Digital Personal Data Protection Act, 2023 (DPDPA) does not explicitly mention or legally define "synthetic data" in its text. The DPDPA does not explicitly use the term "synthetic data" in its legislative text. However, the Act imposes rigorous restrictions on how real digital personal data is collected, forcing organizations to rethink their data pipelines. Synthetic data is increasingly being discussed by legal and privacy experts as a Privacy-Enhancing Technology (PET) to achieve compliance.
DPDPA Act’s core mandates
1. Exemption from the Act's Scope: The DPDPA strictly regulates "digital personal data"—meaning data that can identify a specific living individual. Because properly generated synthetic data cannot be reverse engineered to identify anyone, it is legally classified as non-personal data. This means organizations can process, share, and store synthetic datasets without triggering the DPDPA’s heavy consent, grievance redressal, or data localization requirements.
2. Achieving Data Minimization (Section 6) The DPDPA mandates that organizations collect and retain only the absolute minimum amount of personal data necessary for a clearly stated purpose. If a development team needs to test a new software feature or train a machine learning model, hoarding raw user data violates this principle. Using synthetic datasets allows teams to fulfill their technical requirements while keeping the footprint of real personal data as small as legally required.
3. Bypassing Purpose Limitation Bottlenecks Under the Act, if data is collected for one purpose (e.g., processing a transaction), it cannot be repurposed for something else (e.g., training a predictive AI model) without going back to the user for explicit, unambiguous consent. By synthesizing the data early in the pipeline, developers can freely use the artificial dataset for secondary research and testing without repeatedly burdening the user for new permissions.
4. The Risk of Re-Identification The primary legal risk when using synthetic data under the DPDPA is poor generation. If a generative model (like a GAN or VAE) is overfitted to its training data, it might "memorize" and accidentally leak real outliers or personal identifiers. If a synthetic dataset can be linked back to a real person, it immediately falls back under the DPDPA. Organizations face severe penalties—up to ₹250 crore—for failing to secure personal data. Continuous adversarial testing is required to ensure synthetic data remains truly anonymous.
GAN‑Based Synthetic Data Generation
Generative Adversarial Networks (GANs) play a central role in producing high‑fidelity synthetic data, particularly for complex and high‑dimensional datasets. Tools such as CTGAN [2], TGAN [3] and DoppelGANger [4] demonstrate how GAN architectures can be adapted to different data modalities.
CTGAN focuses on tabular data, using conditional GANs to capture non‑linear and multi‑modal relationships. It is designed to handle real‑world challenges such as imbalanced categorical variables and missing values, making it suitable for domains like finance, healthcare, and marketing where statistical fidelity is essential for machine learning tasks.
DoppelGANger extends GANs to time‑series data, synthesizing realistic temporal sequences from financial and IoT datasets. By preserving dynamic dependencies over time, it enables use cases such as forecasting and anomaly detection without exposing sensitive source data.
TGAN targets high‑dimensional tabular datasets by combining GANs with recurrent neural networks. Its sequential modeling and conditional generation strategies help maintain realistic correlations across continuous and categorical features, supporting analytics and predictive modeling across industries including insurance, retail, and healthcare.
User‑Friendly and Domain‑Specific Tools
Not all synthetic data solutions require deep machine learning expertise. User‑friendly and domain‑specific tools make synthetic data accessible to a broader audience.
Synner [5] emphasizes ease of use, offering intuitive schema design and multi‑format exports such as CSV, JSON, and SQL. These features allow domain experts to rapidly generate datasets tailored to their needs and integrate them directly into existing data pipelines.
In healthcare, Synthea [6] provides open‑source synthetic patient records that reflect realistic clinical workflows and longitudinal care journeys. Its adherence to standards like HL7 FHIR makes it well suited for research, public health modeling, and health IT testing, all while avoiding exposure of protected health information.
Statistical and Privacy‑Focused Solutions
For regulated environments, privacy and utility must be balanced carefully. Statistical and privacy‑focused tools address this challenge directly.
The SDV framework supports multiple modeling approaches—including GANs, variational autoencoders, and copula methods—to generate synthetic data that closely matches the statistical properties of real datasets. Its flexibility has been validated across sectors such as finance, healthcare, and telecom.
MirrorDataGenerator [7] and SmartNoise [8] highlight the growing emphasis on privacy guarantees. These tools incorporate techniques such as differential privacy and protections against inference attacks, enabling safer data sharing and regulatory compliance, particularly in government and healthcare contexts.
Synthetic Data for Testing and Machine Learning
Beyond analytics, synthetic data is increasingly used for software testing and machine learning development. Plaitpy exemplifies this category by simulating complex, heterogeneous datasets with controlled noise and rare anomalies. Such data is valuable for benchmarking, pipeline testing, and improving model robustness when real labeled data is scarce or restricted.
Conclusion
The landscape of synthetic data tools is extensive, encompassing advanced GAN-based models and privacy-centric statistical methods. These tools play a vital role across diverse sectors such as healthcare, finance, and software testing, ensuring a balance between realism, utility, and compliance with the Digital Personal Data Protection Act (DPDPA). As data-related challenges evolve, the integration of these diverse methodologies will be essential for promoting innovation while safeguarding sensitive information in accordance with regulatory standards.
References
https://www.intuitus.co.in/post/lstm-vae-deep-architecture-for-gpu-accelerated-ecg-generation
GitHub - sdv-dev/CTGAN: Conditional GAN for generating synthetic tabular data. · GitHub
https://docs.synthetic.ydata.ai/1.3/synthetic_data/time_series/doppelganger_example/
GitHub - huda-lab/synner: Generating Realistic Synthetic Data · GitHub
