Synthetic Data: Medical research without GDPR headaches?

Developing new methods for early Alzheimer’s detection or innovative treatments for various cancers requires vast amounts of data. Yet strict data protection regulations often prevent researchers from using even available datasets. Many scientists avoid working with patient data altogether, fearing legal risks. A promising solution lies in synthetic data generated by artificial intelligence. In Europe, the SYNTHIA project is leading the way.

The double-edged sword of GDPR

GDPR protects our personal information. At the same time, it can severely restrict access to data that is essential for lifesaving research. Processing health data is often so complex and burdensome that valuable datasets remain untouched.

Take drug development as an example. New compounds must first be tested in laboratories, then on animals, and finally on humans. This process is designed to ensure safety and effectiveness. It takes years to complete. Researchers need to recruit participants, collect and analyse data, and secure regulatory approvals. On average, this takes 12 years and may cost up to 5 billion dollars. Because of these high barriers, pharmaceutical companies often focus only on the most commercially promising treatments.

Now imagine this entire research process taking place virtually. Using AI-generated digital twins instead of real patients, researchers could run multiple virtual trials at once, reducing both time and cost. Personalised therapies could become more accessible. Smaller research teams could enter the field. Innovation would no longer depend solely on large pharmaceutical companies.

Artificial data, real possibilities

Whenever a tool like ChatGPT creates content, it generates synthetic information. This data looks realistic but does not come from real individuals. Synthetic data offers two major advantages. It can be produced in large volumes, quickly and affordably. It is also not subject to privacy regulations. No consent forms, no anonymisation, and no complex approvals are needed.

This is especially valuable in healthcare. Patient data that includes clinical outcomes, demographics, genetics and treatment histories is incredibly useful for research. But anonymising this data takes time and still does not fully remove legal limitations.

Several techniques are used to create synthetic data, including statistical models, rule-based systems and neural networks. In healthcare, Generative Adversarial Networks (GANs) are especially promising. They can generate data that reflects the complexity of real-world datasets. Hybrid models that combine synthetic and real data are also useful, especially in cases involving rare diseases or personalised medicine.

Why healthcare needs synthetic data

Synthetic data is not only useful for drug development. It also supports disease diagnosis, algorithm training and virtual clinical trials. It can help simulate public health scenarios and model epidemics.

Despite these benefits, adoption remains limited. One reason is the lack of clear standards. Synthetic data cannot be evaluated with the same quality metrics used for real data. Another issue is that generating reliable synthetic datasets requires deep expertise in artificial intelligence. If the data is not accurate, it can lead to flawed research results. In medicine, that risk is unacceptable.

SYNTHIA: A European step forward

Launched in late 2024, the SYNTHIA project is the most ambitious European initiative focused on synthetic data for healthcare. The project runs until 2029 and is funded by the EU’s Innovative Health Initiative with a budget of 22.4 million euros. Its aim is to build a federated platform that enables researchers to generate, evaluate and use synthetic patient data in a secure and ethical way.

SYNTHIA focuses on six diseases: lung and breast cancer, multiple myeloma, diffuse large B-cell lymphoma, Alzheimer’s disease and type 2 diabetes.

It uses advanced AI methods such as GANs, federated learning and hybrid modelling. The project produces synthetic datasets that are multimodal and longitudinal. These include lab results, clinical notes, imaging data, genomics and mobile health data. Each dataset is reviewed for privacy, accuracy and clinical usefulness. Labels indicate what the data can be used for, how reliable it is and what privacy safeguards are in place.

The platform allows researchers across Europe to work with synthetic data that meets quality standards and does not violate GDPR rules.

As the SYNTHIA coordinators explain: “Creating efficient synthetic datasets using AI is the only way to protect data privacy while enabling progress in precision medicine.”

Europe now has a chance to lead in data-driven medical research. But success depends on regulation. Synthetic data can only deliver on its promise if new laws support science without creating new barriers.