Beta-binomial model of allele frequency in next-generation sequencing (NGS)
Our method is designed for human applications and assumes a diploid genome. For each locus that contains a single nucleotide variant (SNV) called from NGS data, we define the allele frequency as the number of counts for the alternative (non-reference genome) allele over the total number of depth. For any diploid genome, if an individual is homozygous for the alternative allele (denoted as alternative/alternative, 1/1), the expected allele frequency is 1; if an individual is heterozygous (denoted as reference/alternative, 0/1) at a locus, 0.5 is the expected allele frequency. These theoretical expectations motivated our use of the binomial distribution for the number of reads at each locus,
where \(n\) is the total number of depth at the locus, \(p\) is the theoretical allele frequency, and \(x\) is the number of counts for the alternative allele.
While a simple model is intuitively appealing, previous studies have discovered extra binomial dispersion, specifically, overdispersion of allele frequency distributions [17–20]. This overdispersion results in higher variability than binomial distribution, so a distribution that models such large variance is needed. Previous studies have demonstrated the beta-binomial distribution as an appropriate model for allele frequencies at a particular locus in a subpopulation [21, 22]. The beta-binomial distribution is a discrete hierarchical model containing the beta distribution and binomial distribution, where the probability follows the beta distribution and the response follows the binomial distribution. Hence, the probability mass function of the beta-binomial distribution is
where \(n\) is the total number of reads at the locus; \(B (a , b)\) is the beta function theoretical allele frequency; and \(x\) is the number of counts for the alternative allele. This model has been applied in several studies, and the advantages of beta-binomial distribution compared to binomial distribution when dealing with overdispersion have been repeatedly demonstrated [21, 22]. Prior work using this model motivates our use of the beta-binomial distribution.
Quality control of variant call format (VCF) files
The input format for our method is the well-established VCF format [23]. To our knowledge, ours is the first method to detect same-species contamination using VCF. VCF files contain all the SNV information required in the subsequent steps, but quality control is needed to filter noise and unnecessary information. Because they use various algorithms, different variant calling tools generate different allele frequency patterns. It is strongly suggested that the same software is used for training and testing data to ensure that the features in models are consistent and the classification or regression results are accurate. The recommended quality control and processing steps are outlined below and are additional quality control steps beyond the processing conducted to produce the VCF files.
Step 1. Insertion/deletion (indel) filtering
SNVs (not CNVs such as indels) are used as substitution variants. Only substitution mutations result in heterozygous and homozygous genotypes that can be appropriately modeled by the beta-binomial distribution. Indels, identified as any mutation segments with a length of more than one base pair, are thus filtered/dropped in this step.
Step 2. Homozygous and heterozygous genotype calling
The genotypes for modeling are then called, generating new information that summarizes the genotype in reference to the alternative allele. For any SNV, there is a homozygous and heterozygous genotype for an alternative allele. Suggested genotypes are listed in the GT (genotype) field of the VCF file, where 0/0 is a homozygous reference, 0/1 is a heterozygous reference (“Het”), and 1/1 is a homozygous alternative (“Hom”). This results in two categories of called variants, each of which corresponds with its own beta-binomial model. Homozygous references (0/0) and heterozygous genotypes (1/2, 2/3, and so on) are labeled as “Complex” and are not included in further calculations.
Step 3. Low- and high-depth filtering
This step identifies whether a sequence is a true call or a sequencing error by setting thresholds for coverage depth. A reasonable read-depth threshold should be chosen according to the average read depths of a testing sample. Read depths >50 provide acceptable sensitivity and specificity for mutation detection [24].
Step 4. Change-point detection for CNVs
The features of a pure sample with a CNV region are similar to those of a region with more than one contributor (i.e., same-species contamination). Hence, the CNV region must be filtered before generating features. If CNVs have already been generated, the function vanquish::defcon() can directly filter the CNV region. Otherwise, a change-point detection method is used to detect the CNV region. Variances of B-allele frequency (alternative allele frequency) at heterozygous loci have been reported to differ among normal, duplication, deletion, and loss of heterozygosity (LOH) [25]. Therefore, change-point analysis can be employed to detect the change point of variance (i.e., the border of a copy number region). The change-point package is applied only for heterozygous positions to search for multiple change points of variance [26].
Distribution and likelihood-based features
The next step of our approach generates variables/features used in a model to predict same-species contamination in a sample. Two types of features are generated and used in model building: distribution-based features and likelihood-based features.
Distribution-based features are generated using allele frequency, which is a real number between 0 and 1. Allele frequency is categorized into four regions, as shown in Fig. 1: low alternative allele frequency (LowRate), heterozygous alternative allele frequency (HetRate), high alternative allele frequency (HighRate), and homozygous alternative allele frequency (HomRate). We used respective cut-off values of 0, 0.3, 0.7, and 0.99 for these regions. Fig. 2 shows the difference between pure and contaminated curves.
Table 1
Classification model features and their descriptions
Name
|
Description
|
LOH
|
Het/Hom, the ratio of heterozygous and homozygous markers within a sample
|
HomRate
|
The percentage of the loci in the HomRate region
|
HighRate
|
The percentage of the loci in the HighRate region
|
HetRate
|
The percentage of the loci in the HetRate region
|
LowRate
|
The percentage of the loci in the LowRate region
|
HomVar
|
The variance of allele frequencies in the HomRate region
|
HetVar
|
The variance of allele frequencies in the HetRate region
|
The model building steps generate eight distribution-based features, shown in Table 1. These features reflect the distribution of allele frequencies in an entire file, instead of at each variant calling position. Therefore, each input sample/VCF file is represented by one set of features.
The likelihood-based feature is the average likelihood of all loci in a VCF file, calculated by applying the beta-binomial distribution. We used NA10855 and other available pure samples as a reference genome to calculate the maximum likelihood estimator for parameters \(p\) and \(\rho\) in the beta-binomial distribution (sequenced at Q2 Solutions). The log-likelihood of all loci are calculated with \(pˆ\) and \(\rho ˆ\), generating their average value.
Support vector machine model
After generating features, a classification method determines whether a sample is from a single or multiple contributors. Utilizing the e1071 R package [27], we apply an SVM model because of the complexity in pattern recognition within the feature space [28]. The SVM method fits a hyperplane between single and multiple contributor regions for optimal classification determination. Since a linear model is not guaranteed, the Gaussian radial basis function (RBF) kernel is used to avoid parametric assumptions. As part of the SVM analysis, the cost and gamma parameters are tuned using the parallel searching method. A grid search is conducted on an exponentially growing sequence of cost and gamma parameters to find optimized paired values. The estimated parameters may differ depending on the training data set.
R package: Variant quality investigation helper
Our novel approach detects same-species or within-species contamination using B-allele frequency from only variant call information. The contamination detection procedure comprises the following steps, also outlined in Fig. 3:
Step 1: The VCF generated by a variant caller is read into R using the vanquish:: read_vcf function. The supported variant callers are GATK, VarDict, and strelka2.
Step 2: CNV regions in the VCF file are detected and filtered using the vanquish:: update_vcf function.
Step 3: Features for the radial kernel SVM model are extracted from each sample using the vanquish::generate_feature function.
Step 4: Parameter cost and gamma for kernel SVM are tuned.
Step 5: Contamination of a test sample is predicted.
The ability of our approach to determine contamination can be affected by two scenarios. First, normal-tumor samples comprising a mixture of tumor and normal cells from the same individual may be misclassified as contaminated. Second, for test samples of very low quality, it may be impossible to determine a clear B-allele frequency pattern, so they will not be considered contaminated.
Simulated data test results
To apply our method in real data, we used two reference samples from the 1000 Genomes project [29], NA12878 and NA10855, sequenced at Q2 Solutions. We obtained two pairs of FASTQ format files from sequencing results and resampled and mixed them to different proportions using seqtk [30], as shown in Table 2. For this simulated test, we treated NA12878 as the sample and assumed that NA10855 was mixed into the NA12878 sample at percentages ranging from 0.5–20%. We calculated the detection rate for various levels of contamination. There was a total of 50 million reads for the six mixture samples. Contamination percentages above 5% were readily detected while lower percentages were not (Table 2). Accordingly, the detection analysis has sensitivity above 5% contamination. For contaminants with less similarity to the sample with which they are mixed, the detection sensitivity will be lower; on the other hand, for contaminants with greater similarity, contamination detection will be more challenging.
Table 2
Contamination detection for a simulated data series (M: million).
Sample Component
|
Reads (NA12878)
|
Reads (NA10855)
|
Test Results
|
NA12878 (80%) + NA10855 (20%)
|
40M
|
10M
|
Contaminated
|
NA12878 (90%) + NA10855 (10%)
|
45M
|
5M
|
Contaminated
|
NA12878 (95%) + NA10855 (5%)
|
47.5M
|
2.5M
|
Contaminated
|
NA12878 (97.5%) + NA10855 (2.5%)
|
48.75M
|
1.25M
|
Pure
|
NA12878 (99%) + NA10855 (1%)
|
49.5M
|
0.5M
|
Pure
|
NA12878 (99.5%) + NA10855 (0.5%)
|
49.75M
|
0.25M
|
Pure
|
Real-data test results
After quantitative simulation testing, we applied the trained model in a set of real data comprising 22 samples. Table 3 displays the range of cell types and samples used, and the results. The samples are ranked by regression values from e1071::svm(). While predictions for 20 of the 22 samples were correct according to prior identification, two human-T-lymphoblast samples (see bold text in Table 3) were predicted as pure but were contaminated. In response, we checked the B-allele frequency distribution for these two samples (Fig. 4). The middle area of the CNV pattern was shifted lower from 0.5 to 0.3, indicating the samples were tumor-normal cells from the same individual. The distance of the shift in the CNV pattern reflects the percentage of tumor and normal cells in a sample.
We tested the model with a second data set comprising 53 samples. Twelve samples were purposely mixed with a contaminant, and 41 samples were pure. The test results showed sensitivity > 99.99% and specificity of 90.24%. Four false-positive samples were detected by our method. These false positives were all in formalin-fixed paraffin-embedded (FFPE) tissue samples that were likely degraded (Fig. 5). The false positives may be because the features generated from a degraded sample are similar to those from a contaminated sample.
Table 3
Contamination detection for a real-data series. Predictions for 20 of the 22 samples were correct according to prior identification. Two human T- lymphoblast samples (bold text) were predicted as pure but were contaminated.
Sample Name
|
Classification
|
Regression
|
Prior Identification
|
Human B-Lymphocyte L8
|
1
|
1.9243094
|
1
|
Human B-Lymphocyte 2 L20
|
1
|
1.9209875
|
1
|
Human Breast 2 L16
|
0
|
1.483925
|
0
|
Human Breast L4
|
0
|
1.463376
|
0
|
Human T-Lymphoblast 2 L21∗
|
0
|
1.3622305
|
1
|
Human T-Lymphoblast L9
|
0
|
1.3472358
|
1
|
Human Brain L3
|
0
|
1.3147938
|
0
|
Human Brain 2 L15
|
0
|
1.303287
|
0
|
Human Testis L12
|
0
|
1.245767
|
0
|
Human Cervix 2 L17
|
0
|
1.2429423
|
0
|
Human Testis 2 L24
|
0
|
1.2424441
|
0
|
Human Cervix L5
|
0
|
1.203943
|
0
|
Human Macrophage L10
|
0
|
1.158416
|
0
|
Human Macrophage 2 L22
|
0
|
1.1582528
|
0
|
Human Liver 2 L18
|
0
|
1.1442246
|
0
|
Human Liposarcoma L7
|
0
|
1.1406007
|
0
|
Human Liposarcoma 2 L19
|
0
|
1.132044
|
0
|
Human Skin 2 L23
|
0
|
1.1209464
|
0
|
Human Skin L11
|
0
|
1.1194772
|
0
|
Human Liver L6
|
0
|
1.1170909
|
0
|
Human Reference DNA Male L1
|
0
|
1.0945151
|
0
|
Human Reference DNA Male 2 L13
|
0
|
1.0906951
|
0
|
∗ This is a mixture of tumor and normal cells. See Fig. 2 for the B-allele frequency distribution of this sample.