To test how well chromosomal length variation can predict breast cancer we acquired germline genetic data on breast cancer patients and non-breast cancer patients (for a control group) from two different data sources, the Cancer Genome Atlas (TCGA)[9, 12, 13] and the UK Biobank[14] project.
The Cancer Genome Atlas (TCGA) characterized molecular differences in 33 different human cancers (8, 9). The project collected samples from about 11,000 different patients. The project collected multiple samples from each patient, including tissue samples of the tumor and normal tissue adjacent to the tumor and normal blood samples.
Each patient’s germline DNA was extracted from the normal blood samples. A single laboratory processed all germline DNA samples. Each patient’s germline DNA was genotyped by single nucleotide polymorphisms (SNPs) using an Affymetrix SNP 6.0 array. This SNP data was then processed (by the TCGA project) through a bioinformatics pipeline (10), which included the packages Birdsuite (11) and DNAcopy (12). The pipeline produced a listing of a chromosomal regions (characterized by the chromosome number, a starting location, and an ending location) and an associated value given as the “segmented mean value” for each patient. The segmented mean value is defined as the logarithm, base 2, of one-half the copy number. A normal diploid region with two copies will have a segmented mean value of zero.
The Genomic Data Commons contains most of the TCGA data (13). In the Genomic Data Commons, the copy number variation data is called the masked copy number variation. The masking process removes “Y chromosome and probe sets that were previously indicated to have frequent germline copy-number variation.” (10)
We refer to the final TCGA dataset we used as the masked copy number variation dataset. This dataset originates from normal blood samples extracted from 8,826 different patients: 4,692 females and 4,134 males. The patients’ ages ranged from 10 to 90 years old.
This dataset contains about 695,000 different copy number variations that appear in at least one patient. Copy number variations are genomic regions characterized by the chromosome number, a start position, an end position, and a copy number value. The copy number value is represented as the log base 2 of the ratio of copies present to the expected number of copies, two. A “0” would represent the expected number of copies log2(1), a negative number indicates deletions, and a positive number indicates multiple copies. In the TCGA dataset, normal regions, those with a log_2 CNV equal to 0 are not recorded.
While most copy number variations output by this TCGA pipeline are relatively short, less than the size of a gene, we noted that a few are relatively long, consisting of most of the chromosome’s entire length. For instance, the copy number variation output by the TCGA pipeline that we use to characterize chromosome 1 is 244 megabases long, while the full length of chromosome 1 is 249 megabases. This process produced a dataset with 8,826 rows (each representing a different patient) and 23 columns (each representing one of the chromosomes 1–22 and the X chromosome).
We created a case/control study to differentiate between people with breast cancer and those without breast cancer. For the case/control study, we included all cases in the TCGA dataset that included “normal blood” samples from women patients with a breast cancer diagnosis. No cases were excluded. Patients included also have measurements of copy number variation from DNA derived from normal blood in the database.
Controls for the TCGA dataset include all women in the dataset who had “normal blood” samples without a breast cancer diagnosis. We included only women (no men) in the control sample. Due to the nature of TCGA, each woman in the “control” sample had another type of cancer diagnosis (not breast cancer).
To ensure that the results from the TCGA dataset were not due to systematic effects in producing the data, we tested the same methods in a second independent dataset, the UK Biobank. Because of differences in the way in which the data was made available, we could not directly test the predictive power of a model developed on TCGA data with UK Biobank data and vice versa.
We obtained data from the UK Biobank under Application Number 47850. The UK Biobank project collected genetic data and medical records from about 500,000 people who were between the ages of 40 and 69 during the 2006–2010 recruitment years. These people volunteered for the study and are healthier than the general population(16). Most have supplied biological samples and filled out questionnaires about their health. Most of the participants’ medical records are linked, through the National Health Service, to the UK Biobank records. This linkage provides for ongoing follow-up of health conditions [15, 16].
As previously described [10], we first downloaded the “l2r” files from the UK Biobank (version 1). Each chromosome has a separate “l2r” file. Each “l2r” file contained 488,377 columns and a variable number of rows. Each column represented a unique patient in the dataset, who can be identified with an encoded ID number. Each row represented a different location in the genome. The values in the file represent the log base 2 ratio of intensity relative to the expected two copies measured at the SNP location.
We next computed the mean l2r value for different portions of each chromosome for each patient in the dataset, which we refer to as the “length”. We compute the length for each person’s chromosome using the l2r files by taking the average of all l2r values measured across that chromosome. A value of 0 represents the nominal average length of that portion of the chromosome. We call this dataset the chromosome-scale length variation (CSLV) dataset.
This CSLV dataset was joined with the UK Biobank Health records dataset. UK Biobank matched the person in the Public Health England data with UK Biobanks internal records to produce the person’s encoded participant ID. The dataset we have, provided by UK Biobank, contains the participant ID and whether the participant had been diagnosed with breast cancer.
The UK Biobank dataset that we used consisted of measurements at 820,967 genetic markers across 23 chromosomes for each of 488,377 different patients. From the UK Biobank population, we constructed a dataset of positive cases that included all women who both self-reported having been diagnosed with breast cancer and were identified by cancer registries as having been diagnosed with breast cancer, a total of 1,534 women. We then constructed a control dataset from a pool of 10,000 UK Biobank participants. From this pool of 10,000 we excluded all men and any women that had any type of cancer diagnosis, either self-reported or from a cancer registry. This gave a control group of 4,391 cancer-free.
We quantified the germ line DNA of each of these women with 88 numbers, each representing the length variation of one quarter of each of 22 chromosomes. We did not use the X chromosome.
For both the TCGA data and the UK Biobank data, we set up as a binary classification supervised learning task that was trying to distinguish between patients diagnosed with breast cancer from those not diagnosed with breast cancer.
We performed the analysis using the statistical language R. The data was reformatted for analysis, and then was fed into the machine-learning algorithm. The data included whether the subject had breast cancer, and measurements of copy number variation at distinct locations across the genome derived from the patient’s peripheral blood sample. The data fed into the machine-learning algorithm did not include age since germline DNA should not depend upon age. The results are independent of the patient’s age.
We used the H2O package in R for machine learning. This package implements several common machine learning algorithms, including gradient boosting machines, deep learning neural networks, distributed random forests, and generalized linear models. The best performing algorithms for our datasets are invariably stacked ensembles, which are combinations of other machine learning algorithms. This combination often provides superior results to any particular algorithm[17, 18]. The H2O package implements stacked ensembles as super learner algorithms[19].
We used the H2O Automatic Machine Learning (automl) function to identify a good machine learning model. The automl function is given a specific amount of time and then proceeds to train and tune a series of models, searching for the best model. To obtain confidence intervals, we repeated the training multiple times (at least 10) with different initial randomization. This process provides independent test/train splits to the data.
Statistical tests were performed in R. We computed the 95% confidence intervals using the R command t.test. Normality was first confirmed with the Shapiro test.