Robust specific utility for the primary endpoints
We generated 2160 synthetic datasets with varying parameter configurations, half with CLARITY and half with ADVANCE as reference datasets (Figure 1). Analogous to hyperparameter tuning in predictive model development, we tested different configurations to identify the parameter values yielding the best compromise between privacy and utility. We used CLARITY to assess whether synthetic datasets could capture the information on both efficacy and safety endpoints in the case of a standard parallel two-arm design. We used ADVANCE to test the robustness of the technique for more complex study designs, because the three arms were available and the patients in the placebo arm were re-randomized after 1 year to one of the two peginterferon beta (Peg-IFNb) regimens for the second year. However, only efficacy data were available. Despite the complexity of the ADVANCE study design, only a few individual observations had to be post-processed in some datasets for the study design to be consistent. The missing data patterns due to attrition were well replicated, although the number of patients per arm was not necessarily as balanced as after a true randomization (Supplementary Figures 5 and 6). The primary endpoint estimates were robustly replicated throughout the different configurations (Figure 2). The estimate of CLARITY was within the reported 95% confidence interval (CI) in 783/1080 datasets (72.5%), always with significant p-values. The estimate of ADVANCE was within the reported 95% CI in 876/1080 datasets (81.1%), with 873 (80.8%) of them having a significant p-value.
Robust privacy
Privacy was assessed by the privacy metrics returned by the avatar server. They are defined briefly in Table 1 and in detail on Octopize’s website 22. Among the 2160 generated datasets, only 4 had one avatar, which was by chance identical to a reference data point (i.e., row direct match). The distance of the avatars to the closest reference data point assesses the dispersion of the synthetic data points relative to the set of reference data points: the higher, the better the privacy. It was above 0.2 for all of the generated datasets, which is the recommended threshold by the Avatars Software Editor. We focused the rest of the report on the metric that was the most difficult to optimize, namely, the hidden rate (HR), which is specific to the avatars technique and measures the risk of membership inference attacks 19. This scenario is extreme because the attacker should know all variables of the victim. In our case, the scenario would be that an attacker with access to the synthetic dataset attempts to assess whether the victim was enrolled in the RCT and thus infer his/her diagnosis of MS. The HR is the proportion of patients for whom a distance-based linkage between the reference data point and the corresponding avatar would be erroneous. All 2160 generations had an HR above 80% (Figure 2). The HR increased in the post-processed datasets whose privacy was assessed with the default encoding of all variables and unweighted FAMD projections (not shown). Overall, this shows the robustness of the avatars technique regarding privacy.
Table 1. Privacy metrics of the selected datasets generated with the optimized parameters. The metrics are grouped according to the conceptual anonymization criteria postulated by the European Data Protection Board. Their detailed definitions are available on the software Editor’s website 22. All distances are Euclidean.
Anonymization criteria
|
Metric
|
Definition
|
Software Editor recommendation (Indicative)
|
CLARITY
(Optimized parameters)
|
ADVANCE
(Optimized parameters)
|
Singling out
|
Distance to the closest
|
Median distance between each synthetic data point and their closest reference data point
|
>0.2
|
0.31
|
0.30
|
Distance to the closest ratio
|
Median of the ratio of distances between each synthetic data point and their closest and second closet reference data point
|
>0.3
|
0.81
|
0.60
|
Linkability
|
Column direct match protection
|
Minimum probability that a variable could be used as a direct identifier
|
>50%
|
84.8%
|
90.9%
|
Row direct match protection
|
Percentage of synthetic data points that are identical to reference data points
|
>90%
|
100%
|
100%
|
Inference
|
Median local cloaking
|
Median number of avatars more similar to the reference data point of a patient than its own avatar
|
>5
|
3
|
6
|
Hidden rate
|
Probability of erroneous distance-based matching
|
>90%
|
85.0%
|
93.2%
|
Categorical hidden rate
|
Probability of erroneous distance-based matching based on categorical variables only
|
>90%
|
98.4%
|
98.0%
|
Synthetic dataset selection in the context of a privacy-utility trade-off
The assessment of privacy and utility showed a privacy-utility trade-off (Figure 3). We assessed the general utility with the mean of the Hellinger distances between the univariate distributions. Small k values increased utility while decreasing privacy. A small number of principal component (ncp) values increased utility with few effects on privacy. Weighting and encoding some variables differently could optimize the trade-off as reflected by the generation of datasets closer to the sweet spot with both high general utility and privacy. For both RCTs, we selected the dataset with the most satisfying trade-off of privacy and specific utility. A better general utility did not automatically improve the specific utility. For CLARITY, four datasets (0.4%) replicated all primary and secondary efficacy endpoints. For ADVANCE, no dataset replicated all primary and secondary efficacy endpoints for the two tested regimens, but 14 did when neglecting the non-commercial regimen (1.3%). For CLARITY, we selected the dataset with the best replication of absolute estimates, generated with k = 5, ncp = 5, the weighting of the study arm by 20, and the encoding of relapse counts as categories (0, 1, 2, and 3 or more) and adverse event (AE) counts as Booleans (none vs. any). Such encoding has been reverted at postprocessing before replicating the RCT analysis, but yielded some granularity loss. For ADVANCE, we selected the dataset generated with k = 2, ncp = 10, the weighting of the study arm by 20, and relapse counts and CDW delays by 2. Missing quantitative values were encoded as aberrant negative values instead of left to be imputed by the avatars server. The selected dataset from CLARITY had a median local cloaking (LC) of 3 and an HR of 85.0%; the one from ADVANCE had a median LC of 6 and an HR of 93.2% (Table 1). We focus the rest of the report on both selected datasets (referred to as “optimized”) and two datasets generated with default parameters (k = 10; ncp = 10; not weighted) and the third random state.
Good general utility at the population level despite alterations in variable distributions
Table 2. General utility metrics of the selected datasets generated with the optimized parameters.
General utility metric
|
Definition
|
Recommended target by Octopize (Indicative)
|
CLARITY
(Optimized parameters)
|
ADVANCE
(Optimized parameters)
|
Mean of Hellinger distances
|
Mean of the Hellinger distances of each variable
|
<0.10
|
0.10
|
0.09
|
Correlation difference ratio
|
Average of the absolute variations of Pearson’s correlation
|
<10%
|
2.52%
|
1.49%
|
To assess the general utility of both selected datasets, we evaluated the overlap of the variable distributions. The mean of Hellinger distances was 0.10 and 0.09 for the selected datasets from CLARITY and ADVANCE, respectively (Table 2). Bivariate distributions and weighted FAMD projections were similar (Supplementary Figures 2 and 3), as well as the missing data patterns (Supplementary Figure 4). The effects of the avatar method on the variable distributions were consistent in all generated datasets, only modulated by different parameter configurations (Figure 4). The distributions of the categorical variables were the most preserved, with a tendency to amplify class imbalance. The distributions of the quantitative variables tended to be narrowed and normalized, but their mean was similar if they had a limited skewness. Of note, many distributions, especially MRI lesion counts, were skewed with 0 being the majoritarian value and many outliers on the right tail. As a result of the privacy-by-design approach of the technique, the avatars of the outliers were drastically recentered toward the high-density regions in the synthetic dataset, which tended to decrease the average absolute counts. The most affected variable was the count of GdE lesions at 2 years in ADVANCE. Its average was reduced by about a factor of 3 in the default dataset (0.47 to 0.14), which could be mitigated by the optimized configuration.
The specific utility for multiple endpoints needs optimization
While most generations replicated the primary endpoint of the respective RCT, also replicating all of the secondary endpoints was more challenging (Figures 5 and 6). Generations with default parameters replicated most relative endpoints, but tended to shift absolute endpoints as a result of the amplification of class imbalance by the avatars technique, increasing the percentages of the most represented classes and decreasing the ones of the minoritarian classes. ARR and lesion rates were highly sensitive to the average shift of count variables. This limitation could be mitigated by optimizing the parameters, especially the weighting and encoding of some variables. The replications of the flow charts and tables of both RCT reports are presented in the supplementary information (Supplementary Figures 5 and 6, Supplementary Tables 3–5 and 6–7).
For CLARITY (Figure 5), we pushed the assessment of specific utility up to the replication of interaction tests in a post hoc subgroup analysis in patients with high relapse activity (i.e., 2 or more relapses the year before the study baseline) 23. The alteration in univariate distributions by the avatars method suggested that subgroup analyses would be harder to replicate, but the selected dataset achieved to do so. These post hoc subgroup analyses were critical for the market approval of cladribine in this subpopulation, as the initial submission for the whole relapsing-remitting MS population had been withdrawn due to safety concerns about the risk of neoplasm (6 vs. 0 patients in the real dataset). The safety endpoints were very sensitive to the skewness of count distributions, such that the proportions of patients with serious AEs were drastically reduced in the default dataset. Encoding AEs as Booleans mitigated this and also replicated the contrast of neoplasm incidence (5 avatars with cladribine vs. 0 with placebo).
For ADVANCE (Figure 6), the complex design aimed to compare MS activity during the second year against the first year of treatment to assess the run-in (i.e., delay of action) of Peg-IFNb. Indeed, the selected dataset and the one generated with default parameters replicated the decrease of the ARR during year 2 with the “1 dose per 2 weeks” regimen, while only the optimized one replicated the stability of the ARR with the “1 dose per 4 weeks” regimen. In the selected dataset, the only endpoint that could not be replicated was the 12-week CDW hazard ratio estimate between both tested regimens and the 24-week CDW hazard ratio estimate for the non-commercial regimen. The first was outside the reported 95% CI with a p-value that became significant, while the second was in the wrong effect direction. The replicability of the absolute GdE lesion count was poor whatever the configuration. This limitation was likely associated with the skewness of this variable distribution, essentially made of outliers (Figure 4). Overall, these limitations force the specific utility assessment of synthetic datasets to prioritize the endpoints, as their replicabilities are uneven and may be conditioned by the characteristics of the reference dataset. The synthetic data generation may be optimized toward a given purpose by weighting some variables or encoding them differently.