Same-species Contamination Detection With Variant Calling Information From Next-generation Sequencing

doi:10.21203/rs.3.rs-858518/v1

Download PDF

Research Article

Same-species Contamination Detection With Variant Calling Information From Next-generation Sequencing

https://doi.org/10.21203/rs.3.rs-858518/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Background: Same-species contamination detection is an important quality control step in genetic data analysis. Due to a scarcity of methods to detect and correct for this quality control issue, same-species contamination is more difficult to detect than cross-species contamination. We introduce a novel machine learning algorithm to detect same-species contamination in next-generation sequencing (NGS) data using a support vector machine (SVM) model. Our approach uniquely detects contamination using variant calling information stored in variant call format (VCF) files for DNA or RNA. Importantly, it can differentiate between same-species contamination and mixtures of tumor and normal cells.

In the first stage, a change-point detection method is used to identify copy number variations (CNVs) and copy number aberrations (CNAs) for filtering. Next, single nucleotide polymorphism (SNP) data is used to test for same-species contamination using an SVM model. Based on the assumption that alternative allele frequencies in NGS follow the beta-binomial distribution, the deviation parameter ρ is estimated by the maximum likelihood method. All features of a radial basis function (RBF) kernel SVM are generated using publicly available or private training data.

Results: We demonstrate our approach in simulation experiments. The datasets combine, in silico, exome sequencing data of DNA from two lymphoblastoid cell lines (NA12878 and NA10855). We generate VCF files using variants identified in these data and then evaluate the power and false-positive rate of our approach. Our approach can detect contamination levels as low as 5% with a reasonable false-positive rate. Results in real data have sensitivity above 99.99% and specificity of 90.24%, even in the presence of degraded samples with similar features as contaminated samples. We provide an R software implementation of our approach.

Conclusions: Our approach addresses the gap in methods to test for same-species contamination in NGS. Due to its high sensitivity for degraded samples and tumor-normal samples, it represents an important tool that can be applied within the quality control process. Additionally, the user-friendly software has the unique ability to conduct quality control using the VCF format.

Bioinformatics

Same-species contamination

next-generation sequencing

support vector machine

beta-binomial distribution

High-throughput next-generation sequencing (NGS) has advantages over traditional Sanger sequencing and microarrays in terms of accuracy, cost, and speed [1, 2]. As NGS technologies have matured, best practices for quality control and data processing procedures have also been developed [3]. Detecting sample contamination is a necessary quality control step for the NGS data analysis pipeline since contamination can occur during sample preparation and sequencing analysis. Sample contamination affects downstream sample analysis and may even generate misleading results, leading to false-positive associations and genotype misclassification [4].

Contamination occurs when a sample contains tissues from more than one source and can emerge in NGS samples for various reasons. Further, Despite best practices, the use of unclean lab devices can introduce unexpected materials such as mycoplasma [5]. This occurs in projects of all scales, including the recent large-scale 1000 Genomes Project [6]. Contamination can also arise from sample handling, sample extraction, library preparation and amplification, sample multiplexing, and inaccurate barcode sequencing [7]. Existing contamination detection methods are mainly based on sequencing and allele frequency information for samples and can be categorized into two groups based on the source of contamination: cross-species contamination and same-species contamination.

Cross-species contamination has been well-studied, and modern metagenomics approaches are extensions of cross-species contamination detection approaches. There are several methods for detecting cross-species contamination [8–11]. For example, Schmieder and Edwards [10] developed DeconSeq, a framework for identifying and removing human contamination from microbial metagenomes during sequencing alignment. Merchant et al. [2] scanned samples from Bos taurus, the domestic cow, using microbiome analysis software and found small contigs from microbial contaminants. In these approaches, data are generally assembled from available Sanger reads for known species, and then the unmapped contigs within the assembly are classified by k-mer matching to a database containing all bacteria, archaea, and viruses from the RefSeq database. The presence of contigs aligning with other genomes is a sign of contamination.

In contrast, detecting same-species or within-species contamination is more challenging, and there are few valid, robust approaches. The most commonly implemented approach and the earliest developed is ContEst [12], a module in the Genome Analysis ToolKit (GATK) software [13]. ContEst uses a Bayesian method to calculate the posterior probability of a specific contamination level and find the maximum a posteriori probability (MAP) estimate of the contamination level at homozygous loci. Assuming a uniform prior distribution, Uni f (0, 1), on the contamination level, the posterior distribution of the contamination level is proportional to the joint distribution of observed alleles, given the base calling qualities and the probability of observing true alleles in a contaminated sample. Thus, ContEst requires variant call format (VCF) and binary alignment map (BAM) input and general population frequency information such as base identities and quality scores from sequencing data.

In addition, the VerifyBamID package detects same-species contamination of human DNA samples in both sequence- and array-based data [4]. VerifyBamID implements likelihood- and regression-based approaches that assume a tested DNA sample contains no more than one contaminant. The probability of a sample having a particular contamination level is maximized through a grid search over each contamination level. While VerifyBamID has demonstrated good sensitivity in real-data experiments, copy number alterations (CNAs) in tumor samples shift allele frequencies away from those outside CNA regions, resulting in the misinterpretation of copy number-driven shift as contamination [14]. To address this, the Conpair method builds on the statistical model introduced in VerifyBamID and focuses on homozygous loci to detect additional sources of same-species contamination in samples containing a mixture of tumor and normal cells from the same individual [14]. Given that homozygous markers are invariant to copy number changes, Conpair uses pre-selected, highly informative genomic homozygous markers to perform contamination detection.

More recently developed methods use haplotype structure for contamination detection in NGS data [15]. In one approach, closely spaced single nucleotide polymorphism (SNP) pairs within a sequencing region are identified from the 1000 Genomes database [16], and read haplotypes are inferred for the selected SNP pairs. A human-human admixture is suggested if more than two read haplotypes are observed at a given locus in a sample. The estimated level of contamination for each sample is twice the mean frequency of the minor haplotype.

Current approaches for same-species contamination detection have been successful in a broad range of applications, but there are major limitations. We address these limitations in our approach, which provides substantial improvements in both the practical implementation of quality control procedures and the statistical model used. While existing approaches rely on sizeable human reference genome data as well as at least two large, memory-intensive files, either tumor and normal BAM files (Conpair), or VCF files and BAM files (VerifyBamID and ContEst), our method directly uses information in VCF files through a combination of beta-binomial assumption and support vector machines (SVMs) to detect same-species contamination. Even for tumor-normal paired samples, which are common for individuals with cancer, no additional information is required. The change points of B-allele frequencies (from the VCF file) are detected and then all chromosomes are separated into shorter sequences. Sequences overlapping any copy number variations (CNVs) or aberration regions are detected and filtered. We applied this method in both real and simulated data and found that it has excellent sensitivity and specificity for both types of data. To assist in the application to real data, we developed an R package implementation of the method.

Beta-binomial model of allele frequency in next-generation sequencing (NGS)

Our method is designed for human applications and assumes a diploid genome. For each locus that contains a single nucleotide variant (SNV) called from NGS data, we define the allele frequency as the number of counts for the alternative (non-reference genome) allele over the total number of depth. For any diploid genome, if an individual is homozygous for the alternative allele (denoted as alternative/alternative, 1/1), the expected allele frequency is 1; if an individual is heterozygous (denoted as reference/alternative, 0/1) at a locus, 0.5 is the expected allele frequency. These theoretical expectations motivated our use of the binomial distribution for the number of reads at each locus,

where \(n\) is the total number of depth at the locus, \(p\) is the theoretical allele frequency, and \(x\) is the number of counts for the alternative allele.

While a simple model is intuitively appealing, previous studies have discovered extra binomial dispersion, specifically, overdispersion of allele frequency distributions [17–20]. This overdispersion results in higher variability than binomial distribution, so a distribution that models such large variance is needed. Previous studies have demonstrated the beta-binomial distribution as an appropriate model for allele frequencies at a particular locus in a subpopulation [21, 22]. The beta-binomial distribution is a discrete hierarchical model containing the beta distribution and binomial distribution, where the probability follows the beta distribution and the response follows the binomial distribution. Hence, the probability mass function of the beta-binomial distribution is

where \(n\) is the total number of reads at the locus; \(B (a , b)\) is the beta function theoretical allele frequency; and \(x\) is the number of counts for the alternative allele. This model has been applied in several studies, and the advantages of beta-binomial distribution compared to binomial distribution when dealing with overdispersion have been repeatedly demonstrated [21, 22]. Prior work using this model motivates our use of the beta-binomial distribution.

Quality control of variant call format (VCF) files

The input format for our method is the well-established VCF format [23]. To our knowledge, ours is the first method to detect same-species contamination using VCF. VCF files contain all the SNV information required in the subsequent steps, but quality control is needed to filter noise and unnecessary information. Because they use various algorithms, different variant calling tools generate different allele frequency patterns. It is strongly suggested that the same software is used for training and testing data to ensure that the features in models are consistent and the classification or regression results are accurate. The recommended quality control and processing steps are outlined below and are additional quality control steps beyond the processing conducted to produce the VCF files.

Step 1. Insertion/deletion (indel) filtering

SNVs (not CNVs such as indels) are used as substitution variants. Only substitution mutations result in heterozygous and homozygous genotypes that can be appropriately modeled by the beta-binomial distribution. Indels, identified as any mutation segments with a length of more than one base pair, are thus filtered/dropped in this step.

Step 2. Homozygous and heterozygous genotype calling

The genotypes for modeling are then called, generating new information that summarizes the genotype in reference to the alternative allele. For any SNV, there is a homozygous and heterozygous genotype for an alternative allele. Suggested genotypes are listed in the GT (genotype) field of the VCF file, where 0/0 is a homozygous reference, 0/1 is a heterozygous reference (“Het”), and 1/1 is a homozygous alternative (“Hom”). This results in two categories of called variants, each of which corresponds with its own beta-binomial model. Homozygous references (0/0) and heterozygous genotypes (1/2, 2/3, and so on) are labeled as “Complex” and are not included in further calculations.

Step 3. Low- and high-depth filtering

This step identifies whether a sequence is a true call or a sequencing error by setting thresholds for coverage depth. A reasonable read-depth threshold should be chosen according to the average read depths of a testing sample. Read depths >50 provide acceptable sensitivity and specificity for mutation detection [24].

Step 4. Change-point detection for CNVs

The features of a pure sample with a CNV region are similar to those of a region with more than one contributor (i.e., same-species contamination). Hence, the CNV region must be filtered before generating features. If CNVs have already been generated, the function vanquish::defcon() can directly filter the CNV region. Otherwise, a change-point detection method is used to detect the CNV region. Variances of B-allele frequency (alternative allele frequency) at heterozygous loci have been reported to differ among normal, duplication, deletion, and loss of heterozygosity (LOH) [25]. Therefore, change-point analysis can be employed to detect the change point of variance (i.e., the border of a copy number region). The change-point package is applied only for heterozygous positions to search for multiple change points of variance [26].

Distribution and likelihood-based features

The next step of our approach generates variables/features used in a model to predict same-species contamination in a sample. Two types of features are generated and used in model building: distribution-based features and likelihood-based features.

Distribution-based features are generated using allele frequency, which is a real number between 0 and 1. Allele frequency is categorized into four regions, as shown in Fig. 1: low alternative allele frequency (LowRate), heterozygous alternative allele frequency (HetRate), high alternative allele frequency (HighRate), and homozygous alternative allele frequency (HomRate). We used respective cut-off values of 0, 0.3, 0.7, and 0.99 for these regions. Fig. 2 shows the difference between pure and contaminated curves.

Table 1

Classification model features and their descriptions
Name	Description
LOH	Het/Hom, the ratio of heterozygous and homozygous markers within a sample
HomRate	The percentage of the loci in the HomRate region
HighRate	The percentage of the loci in the HighRate region
HetRate	The percentage of the loci in the HetRate region
LowRate	The percentage of the loci in the LowRate region
HomVar	The variance of allele frequencies in the HomRate region
HetVar	The variance of allele frequencies in the HetRate region

The model building steps generate eight distribution-based features, shown in Table 1. These features reflect the distribution of allele frequencies in an entire file, instead of at each variant calling position. Therefore, each input sample/VCF file is represented by one set of features.

The likelihood-based feature is the average likelihood of all loci in a VCF file, calculated by applying the beta-binomial distribution. We used NA10855 and other available pure samples as a reference genome to calculate the maximum likelihood estimator for parameters \(p\) and \(\rho\) in the beta-binomial distribution (sequenced at Q2 Solutions). The log-likelihood of all loci are calculated with \(pˆ\) and \(\rho ˆ\), generating their average value.

Support vector machine model

After generating features, a classification method determines whether a sample is from a single or multiple contributors. Utilizing the e1071 R package [27], we apply an SVM model because of the complexity in pattern recognition within the feature space [28]. The SVM method fits a hyperplane between single and multiple contributor regions for optimal classification determination. Since a linear model is not guaranteed, the Gaussian radial basis function (RBF) kernel is used to avoid parametric assumptions. As part of the SVM analysis, the cost and gamma parameters are tuned using the parallel searching method. A grid search is conducted on an exponentially growing sequence of cost and gamma parameters to find optimized paired values. The estimated parameters may differ depending on the training data set.

R package: Variant quality investigation helper

Our novel approach detects same-species or within-species contamination using B-allele frequency from only variant call information. The contamination detection procedure comprises the following steps, also outlined in Fig. 3:

Step 1: The VCF generated by a variant caller is read into R using the vanquish:: read_vcf function. The supported variant callers are GATK, VarDict, and strelka2.

Step 2: CNV regions in the VCF file are detected and filtered using the vanquish:: update_vcf function.

Step 3: Features for the radial kernel SVM model are extracted from each sample using the vanquish::generate_feature function.

Step 4: Parameter cost and gamma for kernel SVM are tuned.

Step 5: Contamination of a test sample is predicted.

The ability of our approach to determine contamination can be affected by two scenarios. First, normal-tumor samples comprising a mixture of tumor and normal cells from the same individual may be misclassified as contaminated. Second, for test samples of very low quality, it may be impossible to determine a clear B-allele frequency pattern, so they will not be considered contaminated.

Simulated data test results

To apply our method in real data, we used two reference samples from the 1000 Genomes project [29], NA12878 and NA10855, sequenced at Q2 Solutions. We obtained two pairs of FASTQ format files from sequencing results and resampled and mixed them to different proportions using seqtk [30], as shown in Table 2. For this simulated test, we treated NA12878 as the sample and assumed that NA10855 was mixed into the NA12878 sample at percentages ranging from 0.5–20%. We calculated the detection rate for various levels of contamination. There was a total of 50 million reads for the six mixture samples. Contamination percentages above 5% were readily detected while lower percentages were not (Table 2). Accordingly, the detection analysis has sensitivity above 5% contamination. For contaminants with less similarity to the sample with which they are mixed, the detection sensitivity will be lower; on the other hand, for contaminants with greater similarity, contamination detection will be more challenging.

Table 2

Contamination detection for a simulated data series (M: million).
Sample Component	Reads (NA12878)	Reads (NA10855)	Test Results
NA12878 (80%) + NA10855 (20%)	40M	10M	Contaminated
NA12878 (90%) + NA10855 (10%)	45M	5M	Contaminated
NA12878 (95%) + NA10855 (5%)	47.5M	2.5M	Contaminated
NA12878 (97.5%) + NA10855 (2.5%)	48.75M	1.25M	Pure
NA12878 (99%) + NA10855 (1%)	49.5M	0.5M	Pure
NA12878 (99.5%) + NA10855 (0.5%)	49.75M	0.25M	Pure

Real-data test results

After quantitative simulation testing, we applied the trained model in a set of real data comprising 22 samples. Table 3 displays the range of cell types and samples used, and the results. The samples are ranked by regression values from e1071::svm(). While predictions for 20 of the 22 samples were correct according to prior identification, two human-T-lymphoblast samples (see bold text in Table 3) were predicted as pure but were contaminated. In response, we checked the B-allele frequency distribution for these two samples (Fig. 4). The middle area of the CNV pattern was shifted lower from 0.5 to 0.3, indicating the samples were tumor-normal cells from the same individual. The distance of the shift in the CNV pattern reflects the percentage of tumor and normal cells in a sample.

We tested the model with a second data set comprising 53 samples. Twelve samples were purposely mixed with a contaminant, and 41 samples were pure. The test results showed sensitivity > 99.99% and specificity of 90.24%. Four false-positive samples were detected by our method. These false positives were all in formalin-fixed paraffin-embedded (FFPE) tissue samples that were likely degraded (Fig. 5). The false positives may be because the features generated from a degraded sample are similar to those from a contaminated sample.

Table 3

Contamination detection for a real-data series. Predictions for 20 of the 22 samples were correct according to prior identification. Two human T- lymphoblast samples (bold text) were predicted as pure but were contaminated.
Sample Name	Classification	Regression	Prior Identification
Human B-Lymphocyte L8	1	1.9243094	1
Human B-Lymphocyte 2 L20	1	1.9209875	1
Human Breast 2 L16	0	1.483925	0
Human Breast L4	0	1.463376	0
Human T-Lymphoblast 2 L21∗	0	1.3622305	1
Human T-Lymphoblast L9	0	1.3472358	1
Human Brain L3	0	1.3147938	0
Human Brain 2 L15	0	1.303287	0
Human Testis L12	0	1.245767	0
Human Cervix 2 L17	0	1.2429423	0
Human Testis 2 L24	0	1.2424441	0
Human Cervix L5	0	1.203943	0
Human Macrophage L10	0	1.158416	0
Human Macrophage 2 L22	0	1.1582528	0
Human Liver 2 L18	0	1.1442246	0
Human Liposarcoma L7	0	1.1406007	0
Human Liposarcoma 2 L19	0	1.132044	0
Human Skin 2 L23	0	1.1209464	0
Human Skin L11	0	1.1194772	0
Human Liver L6	0	1.1170909	0
Human Reference DNA Male L1	0	1.0945151	0
Human Reference DNA Male 2 L13	0	1.0906951	0

∗ This is a mixture of tumor and normal cells. See Fig. 2 for the B-allele frequency distribution of this sample.

In this study, we introduce a novel strategy to detect same-species contamination using B-allele frequency from only variant call information. We produced an R package, vanquish: Variant Quality Investigation Helper, for the analysis. Results on simulated data with a range of contamination levels indicate that our method is sensitive to even low levels of contamination, with an extremely low false-positive rate.

We followed up with additional analyses using real data on a range of tissue types, with different sample preparations. The results again indicate our method has excellent performance, with outstanding sensitivity and few false positives. Upon further inspection, the few false positives were from FFPE samples and likely occurred due to degradation of the samples.

We produced a user-friendly R package to enable rapid analysis of same-species contamination. Uniquely, our tool performs this important quality control step from VCF files, resulting in improvements to performance and memory requirements. Fig. 6 summarizes the run time of CNA region removal and feature generation (Hardware: Dell R820, 512GB of RAM). We ran five samples without a known change point 10 times each, with a uniform maximum number of runs of the algorithm, to determine the average run time. Following our expectation, larger samples require more run time for change-point detection and feature generation. Samples with more change points also require a longer run time.

As demonstrated in our data analysis, for samples with both tumor and normal cells, a shift in the CNV distribution reflects the proportion of the cell types. Estimating the percentage of tumor cells within a sample is an active area of bioinformatics research [31]. In ongoing work, we are working to extend the method to produce quantitative estimates.

While cross-species contamination in NGS is well-studied, few approaches have been proposed for same-species contamination. In the current study, we demonstrate a machine learning approach that uses reference samples to build an SVM that classifies samples as either pure or contaminated. The growing number of available reference genomes available through initiatives such as the 1000 Genomes projects allows end-users to readily access and download reference samples.

We demonstrate the utility of our approach with both samples mixed in silico and samples mixed at the bench. Our method has excellent sensitivity, with controlled false positives across a range of contamination levels and tissue and cell types. One of the major advantages of our approach is that the method works after variant calling, allowing the user to interact efficiently with just the VCF file.

Simulation and real application studies

Change-point analysis for approximate copy number region detection

If the copy number information of a sample is not provided, change-point analysis can be conducted to find its copy number regions. The rmChangePoint() function within our vanquish package imports cpt.var() from a change-point package [26]. The pruned exact linear time (PELT) method [26] and the Changepoints for a Range of PenaltieS (CROPS) algorithm [32] are employed to search for variance changepoints. Fig. 7A plots the B-allele frequencies between 0.05 and 0.95 of corresponding loci in the input VCF files. In the CNV patterns, the red vertical lines indicate where variance changes were detected. The plot is separated into sections by these change points. If more than 10% of loci have a B-allele frequency between 0.45 and 0.55 and the skewness is higher than 0.5, the section is included in further analysis. Fig. 7B shows the result after filtering. See the documentation of the vanquish package for more details.

Beta-binomial parameter estimation for reference samples

To calculate likelihood-based features for further analysis, maximum likelihood estimators of ρ for beta-binomial distribution of heterozygous and homozygous models are estimated. For the B-allele frequency, the theoretical value of parameter \(p\) is 0.5 in the heterozygous model and 1 in the homozygous model. \(p\) is fixed at 0.5 and 0.999 to search for ρ in the corresponding model. L-BFGS-B [33] is applied for maximum value searching. For instance, NA10855 was chosen as a reference sample, and five replicates were sequenced by Q2 Solutions. The maximum likelihood estimator of ρ in each sample was estimated by the ρ estimating function in the vanquish package (Table 4). The sample averages were used for further analysis. The value of the estimator highly depends on the variant caller, so using the same variant caller for the reference sample, training sample, and test samples is recommended.

Table 4

Maximum likelihood estimator ρˆ of NA10855 samples. ρˆ of heterozygous and homozygous models was estimated for each sequencing replicate. The sample mean can be used for generating features from the training data set.
	Heterozygous ρˆ	Homozygous ρˆ
NA10855-1	0.154	0.0269
NA10855-2	0.223	0.0253
NA10855-3	0.177	0.0210
NA10855-4	0.187	0.0310
NA10855-5	0.169	0.0274
Sample mean	0.182	0.0263

Features in the classification and regression models

To train the classification and regression model, 238 samples were sequenced by Q2 Solutions as a training data set. Of the 238 samples, 124 were pure and 114 were contaminated. Using the TruSeq DNA PCR-Free sample preparation kit (Illumina Inc., San Diego, CA, USA), sequencing libraries were generated following the recommendations of the manufacturer and index codes were added. The library quality was evaluated with the Qubit@ 2.0 fluorometer (Thermo Scientific, CA, USA) and Agilent Bioanalyzer 2100 device. Finally, the Illumina NovaSeq 6000 platform was used to sequence the library.

Some samples were purposely contaminated in a wet lab, and others were simulated in silico by two pure FASTQ format files [34]. The B-allele frequency patterns differ between pure and contaminated samples. Only the heterozygous loci detected in samples are plotted in Fig. 1. Pure samples (see Fig. 1A) have a narrow horizontal band, and contaminated samples (Fig. 1B) have a relatively uniform distribution for B-allele frequency. Eight boxplots, along with t-tests (null hypothesis of no differences), show the difference between pure and contaminated samples for each feature (Fig. 8). Among the eight features, HomVar, HetVar, and HighRate had significant P-values of 1.786⁻⁹, 1.750⁻⁶, and 4.540⁻²⁰, respectively.

Tuning cost and gamma parameters in the radial kernel SVM

We used the Monte Carlo method (1000 times) and tune() from R package e1071 to tune the cost and gamma parameters in the SVM. The 238 samples were split into training (70%, 167 samples) and test (30%, 71 samples) sets. For the training set, we used grid search to tune the cost parameter in the range of (2−4, 212) and the gamma parameter in the range of (2−4, 24). We then calculated sensitivity and specificity from the test set using tuned cost and gamma. Table 5 shows the results of Monte Carlo simulation, including median values of cost (16) and gamma (0.25) and mean values of sensitivity (97.65%) and specificity (96.27%). We used the tuned cost and gamma parameter in a radial kernel SVM model for contamination prediction.

Table 5

Monte Carlo test results for parameter tuning and performance testing.
Median cost	Median gamma	Average sensitivity	Average specificity
16	0.25	97.65%	96.27%

Features in the support vector machine (SVM)

All SNPs distributed across chromosomes are classified as either homozygous (1/1) or heterozygous (0/1). The LOH value is the ratio of heterozygous SNP loci to homozygous SNP loci. A large LOH value means a sample has more heterozygous SNP loci.

Each SNP locus has a respective B-allele frequency (BAF), which is the percentage of the depth alternative allele from the total depth at each SNP locus. We applied BAF ∈ [0, 1] and three cut-off values to separate the support set of BAF, [0, 1], into four sub-regions: HomRate [0.99, 1], HighRate [0.7, 0.99), HetRate [0.3, 0.7), and LowRate [0, 0.3). A pure sample is expected to have higher HomeRate and HetRate values than a contaminated sample.

HomRate is the number of loci with BAF [0.99, 1] over the total number of SNP loci in a sample.
HighRate is the number of loci with BAF [0.7, 0.99) over the total number of SNP loci in a sample.
HetRate is the number of loci with BAF [0.3, 0.7) over the total number of SNP loci in a sample.
LowRate is the number of loci with BAF [0, 0.3) over the total number of SNP loci in a sample.

For SNP loci distributed within the HomRate region as defined above, the variance of BAF values is defined as HomVar. HetVar is calculated using a similar procedure. A pure sample is expected to have lower HomeVar and HetVar values than a contaminated sample.

The BAF of an SNP locus is assumed to follow the beta-binomial distribution. A reference sample (here, NA10855 sequenced at Q2 Solutions) assumed to be pure is used to calculate the maximum likelihood estimators for parameters \(p\) and \(\rho\) in the beta-binomial distribution. Subsequently, the log-likelihood values of all SNP loci in the sample are summed. For comparability purposes, the log-likelihood sum is then divided by the number of loci in each sample, so that the final outcome is the average log-likelihood across all loci in a sample. A pure sample is expected to have a higher average log-likelihood value than a contaminated sample.

Tunable hyper-parameters

Two hyper-parameters—the soft margin constant \(C\) and the inverse-width parameter of Gaussian kernel \(\gamma\)—are optimized using grid search and cross-validation. Grid search is used to explore the two-dimensional space (\(C , \gamma\)). The grid points of \(C\) are chosen on an exponential scale of (2⁻⁴, 2¹²), and the grid points of \(\gamma\) are chosen between (2⁻⁴, 2⁴). Sensitivity and specificity are estimated for each point on the grid.

BAM: binary alignment map

CNA: copy number aberration

CNV: copy number variations

GATK: Genome Analysis ToolKit

FFPE: formalin-fixed paraffin-embedded

indel: insertion/deletion

LOH: loss of heterozygosity

MAP: maximum a posteriori probability

NGS: next-generation sequencing

PELT: pruned exact linear time

RBF: radial basis function

SNP: single nucleotide polymorphism

SVM: support vector machine

SNV: single nucleotide variant

VCF: variant call format

Ethics approval and consent to participate

Not applicable

Consent for publication

Not applicable

Availability of data and materials

The datasets used for the simulation study are available through the International Genome Sample Resource (IGSR) [https://www.internationalgenome.org/data/]. The data used to demonstrate the method on laboratory-created mixtures are available from Q2 Solutions but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. However, the data are available from the authors upon reasonable request and with permission of Q2 Solutions.

Competing interests

The authors declare that they have no competing interests.

Funding

This work was supported by funding from Q2 Solutions and intramural funds from the National Institute of Environmental Health Sciences.

Authors' contributions

TJ analyzed and interpreted the data for testing. MB generated the real data and helped in interpreting the results. A M-R was a major contributor in writing the manuscript. All authors read and approved the final manuscript.

Acknowledgements

The authors would like to thank Dr. Chad Brown for discussion on same-species contamination and machine learning methods.

van Dijk EL, Auger H, Jaszczyszyn Y, Thermes C: Ten years of next-generation sequencing technology. Trends Genet 2014, 30(9):418-426.
Merchant S, Wood DE, Salzberg SL: Unexpected cross-species contamination in genome sequencing projects. PeerJ 2014, 2:e675.
Patel RK, Jain M: NGS QC Toolkit: a toolkit for quality control of next generation sequencing data. PLoS One 2012, 7(2):e30619.
Jun G, Flickinger M, Hetrick KN, Romm JM, Doheny KF, Abecasis GR, Boehnke M, Kang HM: Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data. Am J Hum Genet 2012, 91(5):839-848.
Schmidt T, Hummel S, Herrmann B: Evidence of contamination in PCR laboratory disposables. Naturwissenschaften 1995, 82(9):423-431.
Langdon WB: Mycoplasma contamination in the 1000 Genomes Project. BioData Min 2014, 7:3.
Simion P, Belkhir K, François C, Veyssier J, Rink JC, Manuel M, Philippe H, Telford MJ: A software tool 'CroCo' detects pervasive cross-species contamination in next generation sequencing data. BMC Biol 2018, 16(1):28.
Strong MJ, Xu G, Morici L, Splinter Bon-Durant S, Baddoo M, Lin Z, Fewell C, Taylor CM, Flemington EK: Microbial contamination in next generation sequencing: implications for sequence-based analysis of clinical samples. PLoS Pathog 2014, 10(11):e1004437.
Laurence M, Hatzis C, Brash DE: Common contaminants in next-generation sequencing that hinder discovery of low-abundance microbes. PLoS One 2014, 9(5):e97876.
Schmieder R, Edwards R: Fast identification and removal of sequence contamination from genomic and metagenomic datasets. PLoS One 2011, 6(3):e17288.
Korneliussen TS, Albrechtsen A, Nielsen R: ANGSD: Analysis of Next Generation Sequencing Data. BMC Bioinformatics 2014, 15(1):356.
Cibulskis K, McKenna A, Fennell T, Banks E, DePristo M, Getz G: ContEst: estimating cross-contamination of human samples in next-generation sequencing data. Bioinformatics 2011, 27(18):2601-2602.
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M et al: The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010, 20(9):1297-1303.
Bergmann EA, Chen BJ, Arora K, Vacic V, Zody MC: Conpair: concordance and contamination estimator for matched tumor-normal pairs. Bioinformatics 2016, 32(20):3196-3198.
Sehn JK, Spencer DH, Pfeifer JD, Bredemeyer AJ, Cottrell CE, Abel HJ, Duncavage EJ: Occult specimen contamination in routine clinical next-generation sequencing testing. Am J Clin Pathol 2015, 144(4):667-674.
Clarke L, Zheng-Bradley X, Smith R, Kulesha E, Xiao C, Toneva I, Vaughan B, Preuss D, Leinonen R, Shumway M et al: The 1000 Genomes Project: data management and community access. Nat Methods 2012, 9(5):459-462.
Zhang S, Wang F, Wang H, Zhang F, Xu B, Li X, Wang Y: Genome-wide identification of allele-specific effects on gene expression for single and multiple individuals. Gene 2014, 533(1):366-373.
Skelly DA, Johansson M, Madeoy J, Wakefield J, Akey JM: A powerful and flexible statistical framework for testing hypotheses of allele-specific gene expression from RNA-seq data. Genome Res 2011, 21(10):1728-1737.
Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras JB, Stephens M, Gilad Y, Pritchard JK: Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 2010, 464(7289):768-772.
Esteve-Codina A, Kofler R, Palmieri N, Bussotti G, Notredame C, Pérez-Enciso M: Exploring the gonad transcriptome of two extreme male pigs with RNA-seq. BMC Genomics 2011, 12:552.
Mayba O, Gilbert HN, Liu J, Haverty PM, Jhunjhunwala S, Jiang Z, Watanabe C, Zhang Z: MBASED: allele-specific expression detection in cancer tissues and cell lines. Genome Biol 2014, 15(8):405.
Chen J, Rozowsky J, Galeev TR, Harmanci A, Kitchen R, Bedford J, Abyzov A, Kong Y, Regan L, Gerstein M: A uniform survey of allele-specific binding and expression over 1000-Genomes-Project individuals. Nat Commun 2016, 7:11101.
Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST et al: The variant call format and VCFtools. Bioinformatics 2011, 27(15):2156-2158.
Morgan JE, Carr IM, Sheridan E, Chu CE, Hayward B, Camm N, Lindsay HA, Mattocks CJ, Markham AF, Bonthron DT et al: Genetic diagnosis of familial breast cancer using clonal sequencing. Hum Mutat 2010, 31(4):484-491.
Ku CS, Polychronakos C, Tan EK, Naidoo N, Pawitan Y, Roukos DH, Mort M, Cooper DN: A new paradigm emerges from the study of de novo mutations in the context of neurodevelopmental disease. Mol Psychiatry 2013, 18(2):141-153.
Killick R, Eckley I: changepoint: An R package for changepoint analysis. Journal of statistical software 2014, 58(3):1-19.
Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F, Chang C-C, Lin C-C, Meyer MD: Package ‘e1071’. The R Journal 2019.
Cortes C, Vapnik V: Support-vector networks. Machine learning 1995, 20(3):273-297.
Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, Marchini JL, McCarthy S, McVean GA, Abecasis GR: A global reference for human genetic variation. Nature 2015, 526(7571):68-74.
Li H: seqtk Toolkit for processing sequences in FASTA/Q formats. GitHub 2012, 767:69.
Yadav VK, De S: An assessment of computational methods for estimating purity and clonality using genomic data derived from heterogeneous tumor tissue samples. Brief Bioinform 2015, 16(2):232-241.
Haynes K, Eckley IA, Fearnhead P: Efficient penalty search for multiple changepoint problems. arXiv preprint arXiv:14123617 2014.
Byrd RH, Lu P, Nocedal J, Zhu C: A limited memory algorithm for bound constrained optimization. SIAM Journal on scientific computing 1995, 16(5):1190-1208.
Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM: The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res 2010, 38(6):1767-1771.

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Same-species Contamination Detection With Variant Calling Information From Next-generation Sequencing

Status:

Version 1

Abstract

Figures

Background

Results

Beta-binomial model of allele frequency in next-generation sequencing (NGS)

Quality control of variant call format (VCF) files

Distribution and likelihood-based features

Support vector machine model

R package: Variant quality investigation helper

Simulated data test results

Real-data test results

Discussion

Conclusions

Methods

Simulation and real application studies

Change-point analysis for approximate copy number region detection

Beta-binomial parameter estimation for reference samples

Features in the classification and regression models

Tuning cost and gamma parameters in the radial kernel SVM

Features in the support vector machine (SVM)

Tunable hyper-parameters

Abbreviations

Declarations

References

Additional Declarations

Status:

Version 1