High-throughput NGS workflow optimization
Our laboratory has developed an automated, high-throughput SARS-CoV-2 NGS workflow (Figure 1). This workflow utilizes a 2-step PCR NGS library preparation method: (1) gene-specific PCR to amplify the SARS-CoV-2 whole genome, using primers published by the ARTIC network, with modifications to add Illumina sequencing primer binding sites; and (2) index PCR to add specimen-specific barcoded sequencing adapters by fusion PCR using the Illumina sequencing primer binding sites. We first optimized primer pools to give even coverage across the SARS-CoV-2 genome. Gene specific primers were pooled into 4 pools (pool 1A, 1B, 2A, 2B, Supplemental Table S1), and positive patient specimens with RT-qPCR cycle threshold (Ct) value of 24 (n=11) were tested. To evaluate the coverage uniformity, we computed the average coverage of each amplicon at a normalized depth of 200,000 mapped reads. All 98 amplicons produced adequate coverage with a mean coverage of 973X (SD, 719; CV, 73.9%) (Figure 2A). However, some amplicons (n=7, 3 in the spike protein coding region) resulted in relatively lower coverage. When the same sample set was tested with a standard ARTIC v3 method, even coverage across the whole genome was achieved with a mean coverage of 1,390X (SD, 658; CV, 47.3%) at 200,000 mapped reads (Figure 2B).
Next, to improve coverage for these low-performing amplicons, we increased the corresponding primer concentration in the same primer pool or added the primer set to a different pool. When tested with RT-qPCR positive patient specimens (Ct 24, n=5), the optimized primer pools improved the coverage of these low-performing amplicons by 2- to 5-fold, resulting in improved amplicon balance with a mean coverage of 893X (SD, 514; CV, 57.6%) (Figure 2C). For the same sample set, standard ARTIC v3 method achieved 1,470X mean coverage depth (SD, 692; CV 47.1%) (Figure 2D).
As the SARS-CoV-2 genome has evolved, various mutations were found at the ARTIC v3 primer binding sites. Upon sequence analysis of 3,506 positive samples collected and processed in January and February of 2021, we found that 96% (3,367/3,506) of the samples had at least one primer binding site mutation with a median of 3.0 (range: 1-9) mutations with the potential to impair PCR efficiency per affected sample (Supplemental Table S3).
To minimize adverse effects on sequencing coverage, we employed a touchdown PCR method by gradually reducing the annealing temperature from 65°C to 55°C (0.7°C/second). We compared 79 samples with two different annealing temperature settings. With this approach, the percent amplicon dropout due to primer binding site mutations was decreased from 0.50–0.01% (Table 1). The percent amplicon dropout was calculated from the number of amplicons that did not generate coverage divided by the total amplicon number. Each dropout amplicon was manually reviewed for the presence of a mutation at the affected primer binding site.
Table 1
Comparison of percent amplicon dropout by annealing temperature settings
Clade1
|
Samples
Tested (N)
|
% Amplicon dropout:
|
% Amplicon dropout:
|
Annealing at 65°C
|
Annealing at 65°C-55°C2
|
20A
|
12
|
0.34
|
0
|
20B
|
10
|
0.51
|
0
|
20C
|
3
|
1.7
|
0.34
|
20G
|
24
|
0.21
|
0
|
20H (Beta, V2)
|
1
|
2.04
|
0
|
20I (Alpha, V1)
|
9
|
0
|
0
|
21C (Epsilon)
|
17
|
1.08
|
0
|
21F (Iota)
|
3
|
0
|
0
|
Total
|
79
|
0.5
|
0.01
|
1 SARS-CoV-2 clades were assigned with Nextclade (https://clades.nextstrain.org/). |
2 Touchdown PCR gradually reduced the annealing temperature from 65°C to 55°C (0.7°C/second). |
Assay Precision
We assessed intra-assay precision using 188 unique specimens with Ct values between 10 and 25 run in 3 replicates. Most (523/564; 92.8%) sequences met quality control metrics (Figure 3A); 175/188 (93.1%) unique samples met quality control requirements for 2 or more replicates and were utilized for lineage and clade comparison. All 175 samples had concordant clade and lineage assignments for all replicates (Table 2). Additionally, spike protein amino acid substitutions were analyzed for all 175 samples. On average, 7.3 ± 2.9 substitutions were detected per sample. There was 99.8% qualitative (1,272/1,275, 95% CI: 99.3-99.9%) agreement when the minority variant frequency was >20% (Supplemental Figure S1A). The average % CV for variant frequency between replicates was 0.5% (min CV: 0.0%; max CV: 6.7%).
Table 2
Intra- and inter-assay concordance of clade and lineage assignment using the automated, high-throughput SARS-CoV-2 NGS workflow
|
Number of samples
|
Clade/lineage1
|
Intra-assay precision
|
Inter-assay precision
|
Total
|
20A/B.1
|
5
|
1
|
6
|
20A/B.1.189
|
0
|
1
|
1
|
20A/B.1.232
|
0
|
1
|
1
|
20A/B.1.234
|
0
|
3
|
3
|
20A/B.1.243
|
0
|
3
|
3
|
20A/B.1.525
|
2
|
1
|
3
|
20A/B.1.539
|
0
|
1
|
1
|
20A/B.1.628
|
1
|
0
|
1
|
20B/B.1.1
|
0
|
1
|
1
|
20B/B.1.1.222
|
1
|
0
|
1
|
20B/B.1.1.231
|
1
|
0
|
1
|
20B/B.1.1.265
|
1
|
0
|
1
|
20B/B.1.1.316
|
1
|
0
|
1
|
20B/B.1.1.318
|
2
|
0
|
2
|
20B/B.1.1.345
|
0
|
1
|
1
|
20B/B.1.1.348
|
1
|
1
|
2
|
20B/B.1.1.434
|
2
|
0
|
2
|
20B/B.1.1.519
|
8
|
15
|
23
|
20B/P.2
|
0
|
1
|
1
|
20B/R.1
|
0
|
4
|
4
|
20C/B.1
|
3
|
0
|
3
|
20C/B.1.1
|
3
|
0
|
3
|
20C/B.1.324
|
1
|
0
|
1
|
20C/B.1.427
|
2
|
5
|
7
|
20C/B.1.429
|
2
|
16
|
18
|
20C/B.1.517
|
6
|
1
|
7
|
20C/B.1.526
|
8
|
4
|
12
|
20C/B.1.526.1
|
3
|
4
|
7
|
20C/B.1.526.2
|
9
|
5
|
14
|
20C/B.1.575
|
2
|
2
|
4
|
20C/B.1.637
|
1
|
0
|
1
|
20G/B.1.2
|
12
|
38
|
50
|
20G/B.1.596
|
4
|
6
|
10
|
20I (Alpha, V1)/B.1.1.7
|
77
|
44
|
121
|
20I (Alpha, V1)/Q.4
|
2
|
0
|
2
|
20I (Alpha, V1)/Q.8
|
1
|
0
|
1
|
20J (Gamma, V3)/P.1
|
5
|
1
|
6
|
21D (Eta)/B.1.525
|
2
|
0
|
2
|
21F (Iota)/B.1.526
|
5
|
0
|
5
|
21H/B.1.621
|
2
|
0
|
2
|
Total
|
175
|
160
|
335
|
1 SARS-CoV-2 clades were assigned with Nextclade (https://clades.nextstrain.org/) and lineages were assigned with Pangolin (https://pangolin.cog-uk.io/). |
We then assessed inter-assay precision for 168 unique specimens with Ct values between 10 and 25 over 3 independent runs. Overall, 479/504 (95.0%) sequences met quality control requirements (Figure 3B). Over 95% (160/168) of samples had valid sequence data for 2 or more replicates and were utilized for clade and lineage comparison. All 160 samples had concordant clade and lineage assignments (Table 2). On average, there were 5.9 ± 3.2 spike protein amino acid substitutions per sample with 99.5% (946/951, 95% CI: 98.8-99.8%) qualitative agreement between runs when the minority variant frequency was present in >20% of the reads (Supplemental Figure S1B). The average % CV for the variant frequency between replicates was 0.9% (min CV: 0.0%; max CV: 20.3%).
Assay Sensitivity
To demonstrate that our high-throughput workflow offers adequate sensitivity to yield complete SARS-CoV-2 genomes, we determined the limit of detection. The limit of detection was defined as the highest Ct value that yielded valid sequence data with ≥97% SARS-CoV-2 genome coverage for at least 95% of the specimens tested. A total of 39 unique samples with Ct values between 17 and 32 were serially diluted, yielding 186 samples with Ct values between 17 and 35. The percent of samples that yielded ≥97% consensus sequence ranged between 91% and 100% up to a Ct value of 27 (Figure 4A). Only 50% (10/20) and 25% (4/16) of samples with Ct values of 28 and 29, respectively, passed. Of note, 100% (20/20) of samples with Ct values of 28 and 88% (14/16) of those with Ct values of 29 generated >90% consensus sequence. Thus, 98.7% (147/149) of samples with Ct values less than 30 yielded >90% consensus sequence (Figure 4B). When over 90% consensus sequence was generated, there was 100% concordance for clade and lineage assignments.
Accuracy study
Next, we assessed the accuracy of the consensus sequences generated by this workflow. Three commercial synthetic RNA positive controls (clade Alpha, Beta, and Gamma variants; Twist Bioscience) were tested 12 times each, using 4 replicates per set-up in 3 independent set-ups, by 3 different scientists. Of note, the synthetic RNA controls do not cover 100% of the SARS-CoV-2 genome, and amplicon dropout was expected. In 36 trials, the mean percent consensus sequence was 95.1% (min 93%; max 98%). All controls resulted in 100% positive percent agreement for clade and lineage assignments (95% CI: 90.4%, 100.0%) and 100% for spike protein mutations that were detected when the minor allele frequency was over 20% (95% CI: 98.8%, 100.0%).
In addition, a total of 84 qRT-PCR negative samples were evaluated, in 3 independent set-ups using 28 negative samples per run, by 3 different scientists. All negative samples gave negative results, with mean amplicon coverage of 0.2 (min 0.0%, max 0.6%) relative to the PCR plate average, yielding 100% sample-level negative percent agreement (95% CI: 95.6%, 100.0%).
Robustness study
To assess assay robustness, a total of 2,688 samples (1,662 unique) were sequenced in a single Illumina NovaSeq sequencing run. Over 94.3% (2,416/2,562) of the total number of sequenced samples and 95.3% (1,584/1,662) of unique samples passed quality control (median coverage 2,478X); all negative samples (n=49) gave a negative result (median coverage 2.4X, <0.1% relative coverage); and all positive controls (n=29) gave the expected lineage and clade assignments. To monitor possible sample contamination, a total of 56 no-template controls (NTCs, 2 per 96 samples) were included. Although 54 of the 56 NTC gave a negligible number of reads (0.025% relative coverage to the plate mean), 2 NTCs showed 5.7% and 8.2% relative coverage indicating possible sample crossover. To assess the effects of the NTC contamination on assay accuracy, we compared the sequences to those obtained from the same samples (n=1,446) previously sequenced using the standard ARTIC v3 workflow on Illumina MiSeq sequencers. Concordance was 99.9% (1,445/1,446) for clade, and 99.8% (1,444/1,446) for lineage assignment (Supplemental Table S4, S5). Moreover, there was 100% clade and lineage concordance for all positive samples processed in the same 384-well plates with the 2 higher-coverage NTC wells. These results indicate that a low degree of NTC contamination is unlikely to interfere with accurate consensus sequence generation or with clade and lineage assignments.
Next, we analyzed the proportion of samples that generated 100% consensus coverage out of the samples that passed coverage QC (Table 3). With the optimized high-throughput workflow, over 95.9% of samples generated a complete consensus (range: 83.0–100% per clade). When the same set of samples were analyzed with a standard ARTIC v3 workflow, the proportion of samples with complete consensus sequence was much lower, 66.1% (range 0–100.0% per clade). In some variants (clades Beta, Delta, Epsilon, and Lambda) when sequenced with the ARTIC v3 workflow, amplicon dropouts were observed in almost all cases; dropout was resolved by employing our high-throughput workflow utilizing touchdown PCR.
Table 3
Proportion of SARS-CoV-2 samples generating complete consensus sequence with the automated, high-throughput SARS-CoV-2 NGS workflow, by clade
|
Samples generating complete consensus sequence
|
with each workflow, % (Complete/Incomplete)1
|
Clade2
|
High-throughput workflow3
|
ARTIC v3 workflow4
|
19B
|
100 (3/3)
|
0 (0/3)
|
20A
|
83.0 (44/53)
|
62.2 (33/53)
|
20B
|
96.2 (102/106)
|
59.4 (63/106)
|
20C
|
88.7 (87/98)
|
47.9 (47/98)
|
20G
|
96 (168/175)
|
81.1 (142/175)
|
20H (Beta, V2)
|
100 (7/7)
|
0 (0/7)
|
20I (Alpha, V1)
|
96.9 (691/713)
|
78.2 (558/713)
|
20J (Gamma, V3)
|
100 (44/44)
|
45.4 (20/44)
|
21A (Delta)
|
100 (1/1)
|
0 (0/1)
|
21C (Epsilon)
|
98.0 (99/101)
|
0.99 (1/101)
|
21D (Eta)
|
100 (10/10)
|
100 (10/10)
|
21F (Iota)
|
97.7 (128/131)
|
61.8 (81/131)
|
21G (Lambda)
|
100 (1/1)
|
0 (0/1)
|
21H
|
100 (3/3)
|
66.6 (2/3)
|
Total
|
96.0 (1,388/1,446)
|
66.1 (957/1,446)
|
1 Complete consensus sequence as defined by obtaining ≥97% SARS-CoV-2 genome coverage. |
2 SARS-CoV-2 clades were assigned with Nextclade (https://clades.nextstrain.org/). |
3 Samples were sequenced on the Illumina NovaSeq 6000 with the SP reagent kit using 2 x 251 cycles. |
4 Samples were sequenced on the Illumina MiSeq with the 600 cycle v3 kit using 2 x 251 cycles. |
Large-scale surveillance study
From January to September 2021, over 65,000 SARS-CoV-2 qRT-PCR-positive or transcription-mediated amplification (TMA)-positive specimens were successfully sequenced in our laboratory. Between January and the end of May 2021, the sequencing libraries were constructed using an in-house validated modified ARTIC v3 protocol19; subsequently, libraries were constructed using the high-throughput workflow designed to improve scalability, reported in this study. Cumulatively, the Delta variant of concern (VOC) was detected at the highest frequency (35%), because of the steep increase in prevalence of this variant after June 2021 (https://covid.cdc.gov/covid-data-tracker/#variant-proportions). Viruses in clade Alpha were the second-most prevalent, accounting for 29% of specimens tested, reflecting their high prevalence before June 2021. Other variants being monitored (VBMs) were detected at much lower frequencies, including Epsilon (5%), Iota (5%), and Gamma (4%) (Table 4). Beta, Lambda, Eta, and Kappa variants were found in less than 1% of the specimens sequenced.
Table 4
SARS-CoV-2 clade distribution of clinical cases analyzed between January and September 2021.
Clade1
|
Number cases
|
Percent
|
21A (Delta)
|
23,239
|
35.13
|
20I (Alpha V1)
|
19,022
|
28.76
|
20G
|
5,665
|
8.56
|
21C (Epsilon)
|
3,475
|
5.25
|
21F (Iota)
|
3,259
|
4.93
|
20C
|
3,103
|
4.69
|
20B
|
2,558
|
3.87
|
20J (Gamma V3)
|
2,540
|
3.84
|
20A
|
1,929
|
2.92
|
21H
|
683
|
1.03
|
20H (Beta V2)
|
156
|
0.24
|
19B
|
141
|
0.21
|
21G (Lambda)
|
140
|
0.21
|
21D (Eta)
|
131
|
0.2
|
20D
|
68
|
0.1
|
21B (Kappa)
|
25
|
0.04
|
20E (EU1)
|
12
|
0.02
|
Total
|
66,146
|
100
|
1 SARS-CoV-2 clades were assigned with Nextclade (https://clades.nextstrain.org/). |
We tracked the proportions of VBM and VOC variants on a weekly basis (Figure S2). The Epsilon variant was predominant (25.6%) in January and February 2021, then subsequently declined in frequency as the Alpha and Iota variants became more prevalent (65.3% and 13.3%, respectively) through May 2021. The Delta variant subsequently emerged in May 2021, rapidly increasing in prevalence to reach 99.8% of all specimens sequenced by mid-September 2021. These trends were consistent with national variant trends as reported by the CDC (https://covid.cdc.gov/covid-data-tracker/#variant-proportions).