Privacy-by-design generation of two virtual clinical trials in multiple sclerosis and their release as open datasets

doi:10.21203/rs.3.rs-4958414/v1

Download PDF

Article

Privacy-by-design generation of two virtual clinical trials in multiple sclerosis and their release as open datasets

https://doi.org/10.21203/rs.3.rs-4958414/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this older preprint version

Read the latest preprint version →

Sharing information provided by individual patient data is restricted by regulatory frameworks due to privacy concerns. Generative artificial intelligence could generate shareable virtual patient populations, as proxies of sensitive reference datasets. Explicit demonstration of privacy is demanded. Here, we determined whether a privacy-by-design technique called “avatars” can generate synthetic randomized clinical trials (RCTs). We generated 2160 synthetic datasets from two RCTs in multiple sclerosis (NCT00213135 and NCT00906399) with different configurations to select one synthetic dataset with optimal privacy and utility for each. Several privacy metrics were computed, including protection against distance-based membership inference attacks. We assessed utility by comparing variable distributions and checking that all of the endpoints reported in the publications had the same effect directions, were within the reported 95% confidence intervals, and had the same statistical significance. Protection against membership inference attacks was the hardest privacy metric to optimize, but the technique yielded robust privacy and replication of the primary endpoints. With optimized generation configurations, we could select one dataset from each RCT replicating all efficacy endpoints of the placebo and commercial treatment arms with a satisfying privacy. To show the potential to unlock health data sharing, we released both placebo arms as open datasets.

Biological sciences/Computational biology and bioinformatics/Databases

Health sciences/Medical research/Study design/Randomized controlled trials

Health sciences/Neurology/Neurological disorders/Multiple sclerosis

Synthetic data

Privacy

Multiple Sclerosis

Anonymization

Randomized clinical trial

Medical practices are becoming data-driven, as empirical evidence is sought to inform all clinical decisions. It especially relies on the secondary use of individual patient data (IPD), first randomized clinical trial (RCT) data to perform post hoc analyses ¹, and second real-world data to produce real-world evidence ². IPD can also be used to perform feasibility studies, to estimate the number of subjects necessary for RCTs, for indirect treatment comparison ^3,4, as learning datasets to develop predictive models, or as external control arms for clinical trials ⁵. As such, sharing RCT data as open datasets has been advocated by European regulators ⁶, but technical implementation of such a policy into standards is currently lacking.

The use and sharing of information provided by health data for clinical research are restricted by regulatory frameworks due to privacy concerns (e.g., the General Data Protection Regulation in Europe [GDPR]; the Health Insurance Portability and Accountability Act in the United States of America [HIPAA]). Privacy is commonly addressed by enforcing the usage control through credential access and by de-identifying the data (i.e., removing direct identifiers), yielding pseudonymous datasets. It does not prevent indirect reidentification by singling out records on sufficient variables ^7,8. For the French data protection board (Commission Nationale de l’informatique et des libertés; CNIL), truly anonymous data must demonstrate the impossibility of linkage to the originating person ⁹. As conceptual guidance, three anonymization criteria have been postulated by the European Data Protection Board ⁷ and integrated into the GDPR: (1) singling out, which is the capacity to reidentify a person from the uniqueness of records in a dataset; (2) linkability, which is the ability to link records concerning the same person across different databases; and (3) inference, which is the possibility from the dataset to deduce with significant confidence sensitive information about a person.

Synthetic data are individual observations generated computationally with “a purpose-built mathematical model or algorithm” ¹⁰. Their most disseminated use case is digital content creation (images or texts) through generative artificial intelligence (AI) models such as generative adversarial networks (GANs) ¹¹ or large language models ¹². In medicine, model-based generators classically rely on GANs or variational autoencoder architectures and are commonly used as data augmentation or privacy-enhancing technologies ¹³. The utility of synthetic datasets may be assessed with general utility metrics and the robustness of the generator ¹⁴. More specifically, utility stems from the veracity of the information content, assessed by replicating aggregated results. As synthetic data are computer-generated instead of collected from a physical person, they are assumed to be anonymous by design. Thus, they appear as an alternative to share the information content of sensitive datasets, by representing it as a set of virtual patients instead of the mathematical formula of a predictive model. However, there is concern about privacy leakage due to the individual granularity of synthetic datasets ^15–17. Hence, there is a growing demand to explicitly assess privacy with quantitative metrics ^9,18.

A synthetic data generator called the “avatars” technique has recently been reported with a privacy-by-design approach ¹⁹. Unlike generative AI models, it has been primarily designed as an anonymization technique with explicit privacy assessment. Briefly, it generates synthetic data points with a multidimensional reduction and nearest neighbors algorithm. For each reference data point, the algorithm creates a local probability density model based on the nearest neighbors in the latent space of a factorial analysis of mixed data (FAMD). A synthetic data point, called an “avatar,” is randomly sampled from the local model. Besides standard privacy metrics, the 1:1 linkage of each avatar with its reference data point enables assessment of the protection against membership inference attacks ¹⁸. The initial report showed that synthetic datasets could be generated with high privacy metrics while outperforming CT-GAN ²⁰ and Synthpop ²¹ in replicating the primary endpoint analyses of an RCT and a cohort study. However, to become effective proxies of sensitive IPD, synthetic data must demonstrate a wider utility than replication of the main analysis of a reference dataset.

In this study, we generated two synthetic RCT datasets in multiple sclerosis (MS) from the CLARITY and ADVANCE phase 3 trials using the avatars technique (Figure 1). MS is the most frequent chronic autoimmune disease of the central nervous system, progressively impairing multiple neurological functions. The main course is marked by relapsing episodes of disabling symptoms, associated with the accumulation of demyelinating lesions assessed by T2-weighted magnetic resonance imaging (MRI) and gadolinium (Gd) enhancement. The classical efficacy endpoints of RCTs evaluating disease-modifying treatments are the annualized relapse rate (ARR), rate of T2 and Gd-enhancing (GdE) lesions, and confirmed disability worsening (CDW). The 3- or 6-month confirmation of the latter aims to rule out reversible relapse-associated symptoms.

Here, we determined to what extent this privacy-by-design technique can generate anonymous virtual patient datasets that capture most of the information reported in RCT publications, including primary and secondary efficacy endpoints, as well as safety. This work made possible the release of both synthetic datasets as open data with approval of the relevant stakeholders, thus demonstrating the potential of synthetic data for information sharing in medicine.

Robust specific utility for the primary endpoints

We generated 2160 synthetic datasets with varying parameter configurations, half with CLARITY and half with ADVANCE as reference datasets (Figure 1). Analogous to hyperparameter tuning in predictive model development, we tested different configurations to identify the parameter values yielding the best compromise between privacy and utility. We used CLARITY to assess whether synthetic datasets could capture the information on both efficacy and safety endpoints in the case of a standard parallel two-arm design. We used ADVANCE to test the robustness of the technique for more complex study designs, because the three arms were available and the patients in the placebo arm were re-randomized after 1 year to one of the two peginterferon beta (Peg-IFNb) regimens for the second year. However, only efficacy data were available. Despite the complexity of the ADVANCE study design, only a few individual observations had to be post-processed in some datasets for the study design to be consistent. The missing data patterns due to attrition were well replicated, although the number of patients per arm was not necessarily as balanced as after a true randomization (Supplementary Figures 5 and 6). The primary endpoint estimates were robustly replicated throughout the different configurations (Figure 2). The estimate of CLARITY was within the reported 95% confidence interval (CI) in 783/1080 datasets (72.5%), always with significant p-values. The estimate of ADVANCE was within the reported 95% CI in 876/1080 datasets (81.1%), with 873 (80.8%) of them having a significant p-value.

Robust privacy

Privacy was assessed by the privacy metrics returned by the avatar server. They are defined briefly in Table 1 and in detail on Octopize’s website ²². Among the 2160 generated datasets, only 4 had one avatar, which was by chance identical to a reference data point (i.e., row direct match). The distance of the avatars to the closest reference data point assesses the dispersion of the synthetic data points relative to the set of reference data points: the higher, the better the privacy. It was above 0.2 for all of the generated datasets, which is the recommended threshold by the Avatars Software Editor. We focused the rest of the report on the metric that was the most difficult to optimize, namely, the hidden rate (HR), which is specific to the avatars technique and measures the risk of membership inference attacks ¹⁹. This scenario is extreme because the attacker should know all variables of the victim. In our case, the scenario would be that an attacker with access to the synthetic dataset attempts to assess whether the victim was enrolled in the RCT and thus infer his/her diagnosis of MS. The HR is the proportion of patients for whom a distance-based linkage between the reference data point and the corresponding avatar would be erroneous. All 2160 generations had an HR above 80% (Figure 2). The HR increased in the post-processed datasets whose privacy was assessed with the default encoding of all variables and unweighted FAMD projections (not shown). Overall, this shows the robustness of the avatars technique regarding privacy.

Table 1. Privacy metrics of the selected datasets generated with the optimized parameters. The metrics are grouped according to the conceptual anonymization criteria postulated by the European Data Protection Board. Their detailed definitions are available on the software Editor’s website ²². All distances are Euclidean.

Anonymization criteria	Metric	Definition	Software Editor recommendation (Indicative)	CLARITY (Optimized parameters)	ADVANCE (Optimized parameters)
Singling out	Distance to the closest	Median distance between each synthetic data point and their closest reference data point	>0.2	0.31	0.30
Singling out	Distance to the closest ratio	Median of the ratio of distances between each synthetic data point and their closest and second closet reference data point	>0.3	0.81	0.60
Linkability	Column direct match protection	Minimum probability that a variable could be used as a direct identifier	>50%	84.8%	90.9%
Linkability	Row direct match protection	Percentage of synthetic data points that are identical to reference data points	>90%	100%	100%
Inference	Median local cloaking	Median number of avatars more similar to the reference data point of a patient than its own avatar	>5	3	6
	Hidden rate	Probability of erroneous distance-based matching	>90%	85.0%	93.2%
	Categorical hidden rate	Probability of erroneous distance-based matching based on categorical variables only	>90%	98.4%	98.0%

Synthetic dataset selection in the context of a privacy-utility trade-off

The assessment of privacy and utility showed a privacy-utility trade-off (Figure 3). We assessed the general utility with the mean of the Hellinger distances between the univariate distributions. Small k values increased utility while decreasing privacy. A small number of principal component (ncp) values increased utility with few effects on privacy. Weighting and encoding some variables differently could optimize the trade-off as reflected by the generation of datasets closer to the sweet spot with both high general utility and privacy. For both RCTs, we selected the dataset with the most satisfying trade-off of privacy and specific utility. A better general utility did not automatically improve the specific utility. For CLARITY, four datasets (0.4%) replicated all primary and secondary efficacy endpoints. For ADVANCE, no dataset replicated all primary and secondary efficacy endpoints for the two tested regimens, but 14 did when neglecting the non-commercial regimen (1.3%). For CLARITY, we selected the dataset with the best replication of absolute estimates, generated with k = 5, ncp = 5, the weighting of the study arm by 20, and the encoding of relapse counts as categories (0, 1, 2, and 3 or more) and adverse event (AE) counts as Booleans (none vs. any). Such encoding has been reverted at postprocessing before replicating the RCT analysis, but yielded some granularity loss. For ADVANCE, we selected the dataset generated with k = 2, ncp = 10, the weighting of the study arm by 20, and relapse counts and CDW delays by 2. Missing quantitative values were encoded as aberrant negative values instead of left to be imputed by the avatars server. The selected dataset from CLARITY had a median local cloaking (LC) of 3 and an HR of 85.0%; the one from ADVANCE had a median LC of 6 and an HR of 93.2% (Table 1). We focus the rest of the report on both selected datasets (referred to as “optimized”) and two datasets generated with default parameters (k = 10; ncp = 10; not weighted) and the third random state.

Good general utility at the population level despite alterations in variable distributions

Table 2. General utility metrics of the selected datasets generated with the optimized parameters.

General utility metric

Definition

Recommended target by Octopize (Indicative)

CLARITY

(Optimized parameters)

ADVANCE

(Optimized parameters)

Mean of Hellinger distances

Mean of the Hellinger distances of each variable

<0.10

0.10

0.09

Correlation difference ratio

Average of the absolute variations of Pearson’s correlation

<10%

2.52%

1.49%

To assess the general utility of both selected datasets, we evaluated the overlap of the variable distributions. The mean of Hellinger distances was 0.10 and 0.09 for the selected datasets from CLARITY and ADVANCE, respectively (Table 2). Bivariate distributions and weighted FAMD projections were similar (Supplementary Figures 2 and 3), as well as the missing data patterns (Supplementary Figure 4). The effects of the avatar method on the variable distributions were consistent in all generated datasets, only modulated by different parameter configurations (Figure 4). The distributions of the categorical variables were the most preserved, with a tendency to amplify class imbalance. The distributions of the quantitative variables tended to be narrowed and normalized, but their mean was similar if they had a limited skewness. Of note, many distributions, especially MRI lesion counts, were skewed with 0 being the majoritarian value and many outliers on the right tail. As a result of the privacy-by-design approach of the technique, the avatars of the outliers were drastically recentered toward the high-density regions in the synthetic dataset, which tended to decrease the average absolute counts. The most affected variable was the count of GdE lesions at 2 years in ADVANCE. Its average was reduced by about a factor of 3 in the default dataset (0.47 to 0.14), which could be mitigated by the optimized configuration.

The specific utility for multiple endpoints needs optimization

While most generations replicated the primary endpoint of the respective RCT, also replicating all of the secondary endpoints was more challenging (Figures 5 and 6). Generations with default parameters replicated most relative endpoints, but tended to shift absolute endpoints as a result of the amplification of class imbalance by the avatars technique, increasing the percentages of the most represented classes and decreasing the ones of the minoritarian classes. ARR and lesion rates were highly sensitive to the average shift of count variables. This limitation could be mitigated by optimizing the parameters, especially the weighting and encoding of some variables. The replications of the flow charts and tables of both RCT reports are presented in the supplementary information (Supplementary Figures 5 and 6, Supplementary Tables 3–5 and 6–7).

For CLARITY (Figure 5), we pushed the assessment of specific utility up to the replication of interaction tests in a post hoc subgroup analysis in patients with high relapse activity (i.e., 2 or more relapses the year before the study baseline) ²³. The alteration in univariate distributions by the avatars method suggested that subgroup analyses would be harder to replicate, but the selected dataset achieved to do so. These post hoc subgroup analyses were critical for the market approval of cladribine in this subpopulation, as the initial submission for the whole relapsing-remitting MS population had been withdrawn due to safety concerns about the risk of neoplasm (6 vs. 0 patients in the real dataset). The safety endpoints were very sensitive to the skewness of count distributions, such that the proportions of patients with serious AEs were drastically reduced in the default dataset. Encoding AEs as Booleans mitigated this and also replicated the contrast of neoplasm incidence (5 avatars with cladribine vs. 0 with placebo).

For ADVANCE (Figure 6), the complex design aimed to compare MS activity during the second year against the first year of treatment to assess the run-in (i.e., delay of action) of Peg-IFNb. Indeed, the selected dataset and the one generated with default parameters replicated the decrease of the ARR during year 2 with the “1 dose per 2 weeks” regimen, while only the optimized one replicated the stability of the ARR with the “1 dose per 4 weeks” regimen. In the selected dataset, the only endpoint that could not be replicated was the 12-week CDW hazard ratio estimate between both tested regimens and the 24-week CDW hazard ratio estimate for the non-commercial regimen. The first was outside the reported 95% CI with a p-value that became significant, while the second was in the wrong effect direction. The replicability of the absolute GdE lesion count was poor whatever the configuration. This limitation was likely associated with the skewness of this variable distribution, essentially made of outliers (Figure 4). Overall, these limitations force the specific utility assessment of synthetic datasets to prioritize the endpoints, as their replicabilities are uneven and may be conditioned by the characteristics of the reference dataset. The synthetic data generation may be optimized toward a given purpose by weighting some variables or encoding them differently.

RCT data are classically accessible through credentials on data-sharing platforms (e.g., vivli.org ²⁴, clinicalstudydatarequest.com ²⁵) to be analyzed in virtual work environments. Their accessibility is conditioned to a predefined analysis plan, which must be designed blindly. While a report of the avatars technique already provided proof of concept that a synthetic dataset can reproduce the primary endpoint ¹⁹, our study shows that it is possible to generate synthetic datasets replicating most absolute and relative endpoints reported in the publications while implementing the regulatory guidance about anonymization. The method proved robust for privacy and the replication of the primary endpoint, but finding a satisfying specific utility required optimization. This optimization process is analogous to the development and selection of machine learning models after searching for the optimal algorithm family and hyperparameters. The common use cases of virtual clinical trials in the literature are cross-sectional trials in radiology ²⁶. A mechanistic simulation approach has been proposed in MS that could replicate the primary endpoint of the AFFIRM trial ²⁷. Another group generated a synthetic dataset from a colon cancer RCT with a machine learning algorithm (sequential decision trees) and could replicate the primary endpoints ²⁸. These approaches have not explicitly assessed privacy. In our use case, the explicit assessment allowed us to legally qualify the synthetic datasets as non-personal data and share them as open datasets. This was even achieved with the complex study design of ADVANCE, which suggests the ability of the avatars technique to capture the information of a wide range of RCTs and complex datasets in other fields.

It remains that the whole granularity of the reference datasets could not be captured, which would be a requirement to use the synthetic dataset as an external comparator. Such external comparison has been performed in the CHAMPION trial in neuromyelitis optica spectrum disorder, a rare and aggressive disease, to evaluate ravulizumab while avoiding exposing patients to a placebo ⁵. RCT datasets are multi-table databases, whereas synthetic data generators typically take a single table dataset as input for training (in the case of GANs) or as a reference dataset for the avatars. Graph autoencoders could however generate multi-table datasets by modeling patient trajectories as directional acyclic graphs ²⁹. Exploration of database structures is another usage of synthetic data that is currently fulfilled by mock sample datasets like the one provided as guidance for industrial RCT common data models by the Clinical Data Interchange Standards Consortium (CDISC) ³⁰. For this educational purpose, we also provided a simulation of a rebuilt version of the synthetic datasets into the CDISC standards, as received by Merck and Biogen. Our results showed that encoding the reference data in a more aggregated fashion (relapses count as 4-level categorical variables, AEs count as Booleans) improved the specific utility regarding the corresponding endpoints. This aggregation could have been pushed further at the cost of a narrowed intended use of the generated datasets. This and the partial data transfer by the industrials limited the granularity that could be captured by the avatars technique.

Since the avatars technique has been primarily developed as an anonymization technique, it tends to recenter the data points in the latent space and alter the univariate distributions, because minoritarian profiles and outliers are easier to reidentify. This is likely to limit the use of the synthetic datasets for exploratory subgroup analysis in populations defined by several criteria or as external synthetic control arms, should significant subgroup matching with the real experimental arm be necessary. Furthermore, the fact that better general utility did not automatically result in better specific utility shows that the assessment of a synthetic dataset cannot be agnostic of the intended use. As such, post hoc analysis of synthetic datasets can only be hypothesis-generating.

The selected synthetic datasets had median LCs and HRs below the targets generally recommended by the Software Editor (Table 1). These targets are only indicative. No technical consensus exists about the required privacy metrics and their acceptable levels. In our specific use case, one has to take into account the combination of other privacy-enhancing processes such as de-identification, time shifting, exclusion of any medico-administrative variable to retain only specialized variables about MS (i.e., data minimization), the aggregation of data into an integrated analysis-ready table, and the increase of HR after post-processing. This suggests that the privacy-utility trade-off of synthetic data generation should be evaluated case-by-case.

The privacy-utility trade-off highlighted by the 2160 datasets we generated (Fig. 3) and the uneven specific utilities (Figs. 5 and 6) are both a limit of the agnostic exploration of synthetic datasets and a perspective for usage control over the data value chain. Beyond the risk of patient reidentification from individual observations, the owner of a reference database may be concerned by the loss of control over the information content of a dataset, should a synthetic dataset have a high and broad utility. According to the intended usage of the synthetic data, the generation may be parametrized or the dataset selected to favor utility or privacy and specific variables.

In alternative to the data-centric approach of synthetic data, the dominant trend in sharing information for medical research is to share calibrations of parametric models. Federative learning is the archetypical framework for developing deep learning models with sensitive data ³¹. Both approaches have been compared operationally with significantly faster processes when sharing synthetic data ³². Still, even if the sensitive data is not shared, the privacy of the model learned from them remains questionable ^15–17. Therefore both approaches could supplement one another with federated learning enabling data owners to enforce their control rules, while synthetic data would address the privacy risk and augment datasets for a given use or context.

We generated synthetic RCT datasets and selected two for release as open datasets with a satisfactory trade-off between privacy and utility. To the best of our knowledge, it is the first report of virtual trials replicating all reported efficacy endpoints for the placebo and commercial regimen arms of several RCTs. The synthetic datasets may be used for various exploratory uses, but the information captured is insufficient for a complete indirect treatment comparison. The privacy-utility trade-off and the uneven specific utility show that synthetic data generation has to be on-purpose, rather than agnostic of the intended use. Besides the privacy enhancement of synthetic datasets, their limited validity for unintended uses provides usage control to the owner of the reference data.

Reference datasets

We used two independent phase three RCTs in MS as reference datasets: CLARITY from Merck (NCT00213135) ³³ and ADVANCE from Biogen (NCT NCT00906399) ³⁴ (Fig. 1). CLARITY enrolled 1326 patients to test two regimens of cladribine versus placebo, and ADVANCE enrolled 1516 patients to test two regimens of Peg-IFNb versus placebo. Both studies included patients without disease-modifying treatment for at least 3 months and lasted 2 years. The data were transferred after privacy-enhancement processes by both companies. For each RCT, we integrated the data into a single analysis-ready table (Fig. 1). The variables were selected to replicate the graphical elements reported in the publications (tables and flow charts) as much as the transferred data enabled us to do. The efficacy endpoints were the relapse activity, the T2 and GdE MRI activity, and CDW. Efficacy and safety data were available for two of the three arms of CLARITY: the placebo and the commercial regimen (n = 865 patients), whereas only efficacy data were available for the three arms of ADVANCE.

Synthetic data generation

The avatars technique was described in detail in its initial report ¹⁹. It generates synthetic data points with a multidimensional reduction and nearest neighbors algorithm. For each reference data point, the algorithm creates a local probability density model based on the topography of the nearest neighbors in the latent space of a FAMD. A synthetic data point, called an “avatar,” is randomly sampled from the local model. Besides standard privacy metrics, the 1:1 linkage of each avatar with its reference data point enables the assessment of the protection against membership inference attacks. The technique is proprietary and implemented in a client-server architecture (Octopize Mimetik, Nantes, France). To help the technique respect the constraints between the variables, we discarded variables with deterministic relations from the integrated analysis-ready table (e.g., sum of two variables; Fig. 1). The minimal dataset of CLARITY had 864 individual observations and 35 variables (7 categorical and 28 quantitative; Supplementary Table 1), and the minimal dataset of ADVANCE had 1512 individual observations and 25 variables (8 categorical and 17 quantitative; Supplementary Table 2). Each individual observation yields a reference data point. For quantitative variables, missing values were handled by default as “missing at random.” For categorical variables, we handled them as “missing not at random,” because they were related to the study design and patient disposition. The avatars server automatically imputes missing values with a K-nearest neighbors imputation algorithm. We used the Python client of the avatars (version 0.7.2) and tested various parameter values for the following:

k: the number of neighbors to create the local probabilistic model;
ncp: the number of projection components to compute the Euclidean distances of the neighbors; and
variable weights: to favor a subset of variables during the multidimensional reduction.

The tested values for k were 2, 5, 10, 15, 20, 25, 30, 40, 50, 75, 100, and 150 and the values tested for ncp were 5, 10, 20, 30, 46, and the maximum possible value. In our use case, ncp could be set up to 61 for CLARITY and up to 65 for ADVANCE, which is higher than the number of variables in the minimal datasets since categorical variables are automatically one-hot encoded by the avatars server. The weighting of the variables were explored by preliminary generations to identify two relevant configurations per RCT in addition to the unweighted configuration. Alternative encodings of some variables were also tested, such as the encoding of relapses counts as categories (0, 1, 2, and 3 or more) and AEs count as Booleans (none vs. any), and the handling of missing quantitative values as aberrant negative values instead of leaving them to be imputed by the avatars server. Five synthetic datasets per configuration were generated with different random states. Finally, we removed patient identifiers and shuffled the selected synthetic datasets before release.

General utility assessment

The utility of synthetic datasets was assessed in two steps that we termed “general” and “specific” utility. The general utility assessed the similarity of the synthetic dataset to the reference dataset regardless of its intended use: the similarity of univariate, bivariate, and multivariate distributions. All analyses were performed in R (version 4.2.3). For univariate distributions, the avatars server returned the mean of the Hellinger distances at the dataset level. Bivariate distributions of numeric variables were analyzed with the matrices of Pearson’s correlation coefficient returned by the avatars server. Multivariate distributions were compared based on unweighted FAMD maps using the FactoMineR package (version 2.9) after multiple imputations with the MICE package (version 3.16.0). Weighted FAMD maps were also returned by the avatars server using a dedicated Python algorithm developed by the software Editor called SAIPH ³⁵.

Specific utility assessment

In our case, specific utility assesses the similarity of the results obtained when replicating the analysis of interest on the synthetic dataset compared to the ones reported in the publications. For CLARITY, we also tested the replication of some post hoc subgroup analyses that proved to be critical for the market approval of cladribine ²³. We used R base functions, MASS, and the Survival packages (versions 7.3–60 and 3.5-7) to replicate the statistical analysis based on the reported methods in the publications. For all endpoints, we considered the analysis to be replicated if (1) the estimate inferred from the synthetic dataset was comprised within the 95% CIs reported in the publication, (2) the direction of the statistical effect was the same, and (3) the conclusion of the statistical test was the same (i.e., the significance of the p-values; <0.05 or not). We estimated the 95% CIs of adjusted ARRs by non-parametric bootstrap with 1000 replications, using the Boot package (version 1.3–28.1).

Privacy assessment

The avatars server returns standard privacy metrics that are defined in Table 1. The HR is specific to the avatars technique and is computed from the LC to measure the risk of membership inference attacks. The development of the LC has been detailed in the report of the avatar technique ¹⁹. Briefly, for each patient, the LC counts the number of avatars that are more similar to his/her reference data point than his/her own avatar. An LC value ≥ 1 means that a distance-based matching would be erroneous for this patient. At the dataset level, privacy is summarized by the median LC and the HR, which is the proportion of patients with an LC of ≥ 1. The Software Editor provides indicative targets for each metric (Table 1). In this study, we considered a median LC of 2 and an HR above 80% to be satisfactory.

Dataset selection

From both RCTs, we selected the synthetic dataset that replicated the primary and secondary efficacy endpoints best while having a satisfying level of privacy. The specific utilities of the datasets replicating all reported statistical test conclusions were inspected individually. In cases of equivalent specific utilities, the dataset with the highest privacy was preferred. In cases where no dataset replicated all the endpoints, the replications of the non-commercial arm endpoints were neglected. If still insufficient, the replication of the primary absolute and relative endpoints (i.e., the relapse activity) was prioritized, followed by T2 MRI activity, CDW, and finally GdE MRI activity, with priority given to relative over absolute secondary endpoints.

Data and code availability

The reference datasets may be shared upon request to Merck and Biogen. The placebo arms of the two selected synthetic datasets have been made publicly available in open access on the figshare platform ³⁶ with the approval of Merck and Biogen, despite these approvals not being necessary from a strict regulatory point of view. The codes are available as R and Python notebooks at gitlab ³⁷. Multi-table simulated versions have been rebuilt according to the original CDISC formats for educational purposes.

Acknowledgments

This work is part of the PRIMUS project, which was supported in part by the French National Research Agency (Agence Nationale de la Recherche, ANR) as its 3rd PIA, integrated to France 2030 plan under reference (ANR-21-RHUS-0014). We thank Biogen and Merck for providing the reference datasets and Octopize for the technical support. We also thank Nathalie Blanc and Joelle Martin-Gauthier for their support in project management.

Disclosures of conflicts of interests

SD, OR, IF, JP, GE, and DL have no conflicts of interest to disclose.

BB is an employee at Biogen. She was neither involved in the conception of the work nor the analysis of the results.

MP is an employee at Merck. She was neither involved in the conception of the work nor the analysis of the results.

MG and AFB are Octopize employees. They were neither involved in the conception of the work nor the analysis of the results.

JDS has participated in advisory boards for Biogen and Merck.

DL has participated in advisory boards for Alexion, Merck, Novartis, and Roche in the last 3 years.

PAG is the founder of Methodomics (2008) and the co-founder of Big Data Santé (2018). He consults for major pharmaceutical companies, and start-ups, all of which are handled through academic pipelines (AstraZeneca, Biogen, Boston Scientific, Cook, Docaposte, Edimark, Ellipses, Elsevier, Janssen, IAGE, Lek, Methodomics, Merck, Mérieux, Octopize, Sanofi-Genzyme, Lifen, Aspire UAE). PA Gourraud is a volunteer board member at AXA not-for-profit mutual insurance company (2021). He has no prescription activity with either drugs or devices. He receives no wages from these activities.

Kappos, L. et al. Contribution of Relapse-Independent Progression vs Relapse-Associated Worsening to Overall Confirmed Disability Accumulation in Typical Relapsing Multiple Sclerosis in a Pooled Analysis of 2 Randomized Clinical Trials. JAMA Neurol.77, 1132–1140 (2020).
Warnke, C. & Hartung, H.-P. Big data in MS—What can we learn from large international observational studies such as MSBase? Mult. Scler. J.26, 4–5 (2020).
Caro, J. J. & Ishak, K. J. No head-to-head trial? simulate the missing arms. PharmacoEconomics28, 957–967 (2010).
Signorovitch, J. E. et al. Matching-adjusted indirect comparisons: a new tool for timely comparative effectiveness research. Value Health J. Int. Soc. Pharmacoeconomics Outcomes Res.15, 940–947 (2012).
Pittock, S. J. et al. Ravulizumab in Aquaporin-4-Positive Neuromyelitis Optica Spectrum Disorder. Ann. Neurol.93, 1053–1068 (2023).
Eichler, H.-G., Abadie, E., Breckenridge, A., Leufkens, H. & Rasi, G. Open Clinical Trial Data for All? A View from Regulators. PLOS Med.9, e1001202 (2012).
DATA PROTECTION WORKING PARTY. Opinion 05/2014 on Anonymisation Techniques. (2014).
Rocher, L., Hendrickx, J. M. & de Montjoye, Y.-A. Estimating the success of re-identifications in incomplete datasets using generative models. Nat. Commun.10, 3069 (2019).
Commission Nationale de l’informatique et des libertés. L’anonymisation de données personnelles. https://www.cnil.fr/fr/lanonymisation-de-donnees-personnelles (2020).
Giuffrè, M. & Shung, D. L. Harnessing the power of synthetic data in healthcare: innovation, application, and privacy. Npj Digit. Med.6, 1–8 (2023).
Zhu, J.-Y., Krähenbühl, P., Shechtman, E. & Efros, A. A. Generative Visual Manipulation on the Natural Image Manifold. in Computer Vision – ECCV 2016 (eds. Leibe, B., Matas, J., Sebe, N. & Welling, M.) 597–613 (Springer International Publishing, Cham, 2016). doi:10.1007/978-3-319-46454-1_36.
Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med.29, 1930–1940 (2023).
Demuth, S., Paris, J., Faddeenkov, I., De Sèze, J. & Gourraud, P.-A. Clinical applications of deep learning in neuroinflammatory diseases: A scoping review. Rev. Neurol. (Paris) (2024) doi:10.1016/j.neurol.2024.04.004.
El Emam, K. Seven Ways to Evaluate the Utility of Synthetic Data. IEEE Secur. Priv.18, 56–59 (2020).
Chen, Y. & Esmaeilzadeh, P. Generative AI in Medical Practice: In-Depth Exploration of Privacy and Security Challenges. J. Med. Internet Res.26, e53008 (2024).
G7 Data Protection and Privacy Authorities. Roundtable of G7 Data Protection and Privacy Authorities Statement on Generative AI. https://www.cnil.fr/sites/cnil/files/2023-06/g7roundtable_202306_statement.pdf (2023).
Sun, H. et al. Adversarial Attacks Against Deep Generative Models on Data: A Survey. IEEE Trans. Knowl. Data Eng.35, 3367–3388 (2023).
El Emam, K., Mosquera, L. & Fang, X. Validating a membership disclosure metric for synthetic health data. JAMIA Open5, ooac083 (2022).
Guillaudeux, M. et al. Patient-centric synthetic data generation, no reason to risk re-identification in biomedical data analysis. NPJ Digit. Med.6, 37 (2023).
Zhao, Z., Kunar, A., Van der Scheer, H., Birke, R. & Chen, L. Y. CTAB-GAN: Effective Table Data Synthesizing. Preprint at https://doi.org/10.48550/arXiv.2102.08369 (2021).
Nowok, B., Raab, G. M. & Dibben, C. synthpop: Bespoke Creation of Synthetic Data in R. J. Stat. Softw.74, 1–26 (2016).
Hello from Octopize Docs | Octopize Docs. https://docs.octopize.io/.
Giovannoni, G. et al. Efficacy of Cladribine Tablets in high disease activity subgroups of patients with relapsing multiple sclerosis: A post hoc analysis of the CLARITY study. Mult. Scler. Houndmills Basingstoke Engl.25, 819–827 (2019).
Vivli - Center for Global Clinical Research Data. https://vivli.org/, https://vivli.org/.
ClinicalStudyDataRequest.com. https://clinicalstudydatarequest.com/.
Badano, A. et al. Evaluation of Digital Breast Tomosynthesis as Replacement of Full-Field Digital Mammography Using an In Silico Imaging Trial. JAMA Netw. Open1, e185474 (2018).
Sips, F. L. P., Pappalardo, F., Russo, G. & Bursi, R. In silico clinical trials for relapsing-remitting multiple sclerosis with MS TreatSim. BMC Med. Inform. Decis. Mak.22, 294 (2022).
Azizi, Z. et al. Can synthetic data be a proxy for real clinical trial data? A validation study. BMJ Open11, e043497 (2021).
Nikolentzos, G., Vazirgiannis, M., Xypolopoulos, C., Lingman, M. & Brandt, E. G. Synthetic electronic health records generated with variational graph autoencoders. Npj Digit. Med.6, 1–12 (2023).
CDISC. Synthetic SDTM sample dataset. https://github.com/lhncbc/r-snippets-bmi/tree/master/cdisc/inst/extdata/cdisc01/csv.
Lee, G. H. & Shin, S.-Y. Federated Learning on Clinical Benchmark Data: Performance Assessment. J. Med. Internet Res.22, e20891 (2020).
Azizi, Z. et al. A comparison of synthetic data generation and federated analysis for enabling international evaluations of cardiovascular health. Sci. Rep.13, 11540 (2023).
Giovannoni, G. et al. A placebo-controlled trial of oral cladribine for relapsing multiple sclerosis. N. Engl. J. Med.362, 416–426 (2010).
Calabresi, P. A. et al. Pegylated interferon beta-1a for relapsing-remitting multiple sclerosis (ADVANCE): a randomised, phase 3, double-blind study. Lancet Neurol.13, 657–665 (2014).
GitHub - octopize/saiph: A projection package. https://github.com/octopize/saiph.
Privacy-by-design generation of two virtual clinical trials in multiple sclerosis and their release as open datasets. https://figshare.com/s/ba49ed0550fd069567e6.
Stanislas Demuth / Privacy-by-design generation of two virtual clinical trials in multiple sclerosis · GitLab. GitLab https://gitlab.com/stanislas.demuth/avatars-for-randomized-clinical-trials (2024).

Competing interest reported. SD, OR, IF, JP, GE, and DL have no conflicts of interest to disclose. BB is an employee at Biogen. She was neither involved in the conception of the work nor the analysis of the results. MP is an employee at Merck. She was neither involved in the conception of the work nor the analysis of the results. MG and AFB are Octopize employees. They were neither involved in the conception of the work nor the analysis of the results. JDS has participated in advisory boards for Biogen and Merck. DL has participated in advisory boards for Alexion, Merck, Novartis, and Roche in the last 3 years. PAG is the founder of Methodomics (2008) and the co-founder of Big Data Santé (2018). He consults for major pharmaceutical companies, and start-ups, all of which are handled through academic pipelines (AstraZeneca, Biogen, Boston Scientific, Cook, Docaposte, Edimark, Ellipses, Elsevier, Janssen, IAGE, Lek, Methodomics, Merck, Mérieux, Octopize, Sanofi-Genzyme, Lifen, Aspire UAE). PA Gourraud is a volunteer board member at AXA not-for-profit mutual insurance company (2021). He has no prescription activity with either drugs or devices. He receives no wages from these activities.

Supplementarymaterial.docx

Download PDF

Version 1

posted

You are reading this older preprint version

Read the latest preprint version →

Privacy-by-design generation of two virtual clinical trials in multiple sclerosis and their release as open datasets

Status:

Version 1

Abstract

Figures

Introduction

Results

Robust specific utility for the primary endpoints

Robust privacy

Synthetic dataset selection in the context of a privacy-utility trade-off

Good general utility at the population level despite alterations in variable distributions

The specific utility for multiple endpoints needs optimization

Discussion

Conclusion

Methods

Synthetic data generation

General utility assessment

Specific utility assessment

Privacy assessment

Dataset selection

Declarations

Data and code availability

Acknowledgments

Disclosures of conflicts of interests

References

Additional Declarations

Supplementary Files

Status:

Version 1