Study cohort
Recruitment and description
We included de-identified data from 30,901 female participants from the EstBB cohort. In brief, the EstBB is managed by the Estonian Genome Center at the University of Tartu (EGCUT) and was established to collect genetic and health information, from a large sample of the Estonian population, to advance public health(29–31).
Eligible participants were 18 years or older volunteers with Estonian nationality. Approximately 10.000 participants were recruited from 2002 to 2004. Recruitment was thereafter paused until 2006 due to financial circumstances, but continued again from 2006 to 2012, from all of Estonia (15 counties), at which the EstBB cohort had included almost 52,000 participants.
Participants were initially recruited through general practitioners. Recruitment was later extended to private practices and hospitals, and special recruitment offices of the EGCUT. Recruitment was completely volunteer-based, meaning that no direct contact to the Estonian population was allowed (e.g. through invitation letters) but participants had to actively sign up after hearing about the cohort study in private, at their health care institution, or through promotion at special public events and in the media (29, 30).
By 2012, the number of participants in the EstBB cohort corresponded to approx. 5% of the adult Estonian population; 66% were women. The majority is of Estonian ethnicity (81.2%) but the cohort also includes Russian (15.4%), Ukrainian (1.3%), and Belarusian (0.6%) ethnicities. According to the Estonian census report from 2000(32), the EstBB cohort overall represents the general population well although there is an overrepresentation of women (by approx. 10%), of younger- and middle-age generations, and of people with a higher and professional secondary educational level(29, 30).
Eligibility
Included in this analysis were data from women from the EstBB cohort between 20–89 years and without a current or previous diagnosis of breast cancer at the time of cohort entry. We excluded women who had entered the cohort after 2011, as their 3- and 5-year follow-up information was not yet available at the time of analysis (Fig. 1).
Data collection
Information on all variables used in this analysis (smoking status, educational level, prevalent comorbidities, age, BMI, and PRS) was initially collected on the day of recruitment through standardized interviews and questionnaires, blood samples, and from existing medical records. Diseases were classified according to the international Classification of Diseases (ICD-10)(29, 30).
The PRS (metaGRS2) used in this analysis was developed by Läll and colleagues. In brief, they develop several different PRS (GRS), based on the principle that the individual effect of identified breast cancer SNPs, each weighted by their corresponding logistic beta-coefficients (most often from GWAS), can be linearly combined into a single summary PRS-value. A person’s individual PRS-value is then the weighted sum of the SNPs that this person carries (17, 27).
Läll et. al derived 7 different PRS for the EstBB cohort in three main steps. They first derived two PRS (a GRS70 and GRS75, respectively) based on two different sets of SNPs from other recent PRS publications. They then derived several other PRS based on summary statistics from two recent GWAS studies (the Breast Cancer Association Consortium and the UK Biobank), from which they selected the PRS with the 2 smallest p-values for the association with breast cancer (GRSONCO and GRSUK), using logistic regression analysis on the EstBB cohort (319 women with prevalent breast cancer and 2000 women without prevalent breast cancer). Thereafter, they derived 3 meta-PRS based on (i) the weighted average of the 4 individual PRS (metaGRS4), (ii) the weighted average of the three strongest associated PRS (metaGRS3), and (iii) the weighted average of the two strongest associated PRS (metaGRS2). Of all seven PRS, metaGRS2 showed the strongest association with breast cancer and was selected for this analysis. More details about the PRS can be found on the full report and supplementary files of Läll et. al (27).
Follow-up information on incident breast cancers and deaths was collected through biennial linkage to the Estonian Health Insurance Fund, the Estonian Causes of Death Registry, and the Estonian Cancer registry. Every recorded diagnosis of breast cancer was confirmed by an oncologist. The last linkages used in this analysis were performed in December 2015 for breast cancer and in June 2017 for death (27). Below, we studied incident diagnosed breast cancer and death after entry into the EstBB cohort. More information about the EstBB cohort and the PRS can be found elsewhere (27, 29–31, 33).
Statistical analyses
Descriptive statistics for our study group are presented in Table 1. For continuous covariates (age, BMI, and follow-up time) we show the mean and median as well as standard deviation (SD) and interquartile range (IQR), and the number (n) and proportion (%) for the categorical covariates (education, smoking, prevalent co-morbidities).
PRS results were categorized into 6 subgroups (0–25%, 25–50%, 50–75%, 75–85%, 85–95% and the top 5% PRS percentiles) in accordance with the analysis performed by Läll et al (27). We subsequently fitted a full multivariable Cox proportional hazard regression model with main effects of age, BMI, year of entry, 6-level categorical predictors of PRS, smoking status (never/former/current), education (less than secondary/secondary/university degree), and prevalent co-morbidities (any prevalent cancer, Type 1 diabetes, Type 2 diabetes, myocardial infarction, and coronary artery disease), together with interaction terms for BMI and age, and PRS and age. Follow-up time was measured in years from cohort entry to last linkage.
Only two covariates were statistically significant predictors and retained in subsequent models: age and PRS. We then fitted two Cox regression models: one including only age, to reflect the current screening strategy, the second adding PRS to age as predictor. From the latter PRS-age-based model fitted to the full data set, we showed breast cancer specific hazard ratios with main effects for age and PRS-groups (Table 2).
The proportionality assumption for both PRS and age was formally tested using Schoenfeld’s residuals (Supplementary Figure S1) and the fitted PRS-age-model and age-model were compared with a likelihood ratio test. We estimated cause specific hazards of ‘death without breast cancer’ and ‘breast cancer’ from Cox regressions for each woman for 3 and 5 years, and evaluated the cumulative incidence of breast cancer from both the PRS-age model and age-model. Below, we referred to these model-estimated cumulative incidences as PRSage_mInc and age_mInc, respectively. They express the calculated probability of developing a clinically diagnosed breast cancer within the corresponding follow-up period. We analyzed and compared the distributions of age_mInc and PRSage_mInc in the study group.
To avoid overoptimistic incidence estimates from the fitted models, we evaluated each model’s performance using 10-fold cross-validation. We randomly split the data into 10 equally sized parts, of which 9 parts were used to fit the model and derived 3- and 5-year estimated incidences for the remaining 10th part test set. This process was repeated 10 times, once for each distinct left-out test set. All 10 test-sets were then combined into the merged data set with their independently estimated incidences. This was used to evaluate risk distribution, calibration, discrimination, and reclassification for both PRSage_mInc and age_mInc.
Unadjusted cumulative incidence curves over time since study entry until last follow-up of breast cancer (in years) were constructed separately for observed breast cancer and death without breast cancer, stratified by PRS-groups (Fig. 2).
Since the current Estonian screening program invites women from the age of 50 to 62 years for biennial screening(34), we constructed three age-groups of women: <50 years, 50 to 62 years, and > 62 years. For evaluating calibration, we combined the PRS and age-groups, which resulted in 18 different PRS-age-subgroups.
Our calibration plots show for each PRS-age-subgroup the observed cumulative incidence against the mean PRSage_mInc for the same group. The observed cumulative incidence was calculated as the proportion (%) of breast cancer events within 3 or 5 years among the total number of women in each PRS-age-subgroup, with corresponding 95% CI calculated using the Wilson method (Fig. 4, Table 3). Per study design, no censoring was encountered over the periods envisaged.
Receiver operating characteristic (ROC) curves were constructed and the corresponding area under the curves (AUC) was calculated for the age_mInc and PRSage_mInc, for both the 3- and the 5–year time points. The 95% CI of the AUC was calculated with a non-parametric method (Delong) (Fig. 5).
Using a risk threshold of ≥ 1%, we classified women as high or low risk. We constructed reclassification tables and calculated the net-reclassification index (NRI) to explore how the PRSage_mInc reclassified women into high- and low-risk groups, compared to classification by the age_mInc. Hence, within the group of women with breast cancer the absolute number and proportions of correctly reclassified women (i.e. women moving from an age_mInc below 1% to a PRSage_mInc ≥ 1%) and incorrectly reclassified women (i.e. women moving from an age_mInc ≥ 1% to an PRSage_mInc below 1%) were calculated. This was likewise calculated for the group of women without breast cancer, but with the opposite direction for correct reclassifications (i.e. women moving from an age_mInc ≥ 1% to a PRSage-mInc below 1%) and incorrect reclassifications (i.e. women moving from an age_mInc below 1% to a PRSage_mInc ≥ 1%). As an overall summary measure, the NRI was then calculated by first subtracting the incorrect reclassifications from the current ones in each group, and adding the two proportions. The 95% CI intervals of the NRI were calculated as proposed by Pencina et al, 2008 (35) (Tables 5 and 6).
P-values below 0.05 were considered to indicate statistically significant differences. All statistical analyses were performed using RStudio (Version 1.1.463, 2009–2018 RStudio, Inc.)