The relative abundance of SARS-CoV-2 genome sequences in patient samples is generally low, requiring overlapping amplicon sequencing or deep shotgun sequencing to accurately detect and reconstruct them. Up until now, the NRC for Respiratory Viruses in France (Institut Pasteur, Paris) has used two approaches to obtain SARS-CoV-2 genomes: shotgun metagenomics for an abundance up to CT25 and amplicon sequencing after CT25. We propose a third and complementary approach using hybridization-capture sequencing (Figure 1).
Study Design
We performed capture using 20 patient samples positive for SARS-CoV-2 by RT-qPCR to evaluate the efficacy of capture panels to access the SARS-CoV-2 genome in patient samples.
Sample ID
|
[RNA] ng/µl*
|
RQN+
|
qRT PCR (CT)
|
Groups
|
4885
|
11
|
3.2
|
26.12
|
A
|
4716
|
3.0
|
4.9
|
25.77
|
4660
|
1.1
|
1.4
|
26.55
|
4520
|
4.0
|
4.5
|
25.51
|
4707
|
5.1
|
4.3
|
28.57
|
B
|
4697
|
45
|
1.6
|
28.72
|
4676
|
0.6
|
1.3
|
28.93
|
4653
|
0.8
|
1.0
|
29.27
|
4861
|
1.7
|
1.0
|
32.95
|
C
|
4787
|
2.5
|
2.5
|
32.59
|
4688
|
111
|
2
|
31.11
|
4673
|
4.4
|
1.6
|
32.6
|
4777
|
2.4
|
7.7
|
35.44
|
D
|
4668
|
0.8
|
2.1
|
35.88
|
4510
|
undetectable
|
undetectable
|
34.65
|
4489
|
4.2
|
1.7
|
35.15
|
4798
|
0.2
|
2.5
|
36.46
|
E
|
4797
|
0.6
|
3.6
|
39.19
|
4656
|
0.4
|
1.5
|
36.54
|
4544
|
7.8
|
1.4
|
36.74
|
*Nanodrop 1000 and Fragment Analyser
|
|
Table 1. Characteristics of patient nasal-swab samples. The 20 patient samples were grouped in 5 categories (A to E) according to their increasing average CTs
We used five different probe panels: two corresponding to the entire SARS-CoV-2 viral genome and three composed of a mix of different viruses (pan-viral). All panels are ready-to-use commercial designs. For each panel, the capture reactions were performed using the same library preparation for each sample. All libraries were sequenced prior to capture using a NextSeq 500 in high-output mode, paired-end 150 bp. Samples were pooled in groups of four, according to their CT (Table 1), to reduce bias related to target abundance within the samples. We used either a multi-virus panel or a SARS-CoV-2specific panel for the captures. For the multi-virus panels, samples were sequenced using the NextSeq 500 high-output format, paired-end 150 bp, of Illumina. As for the dedicated SARS-CoV-2 panels, all samples were sequenced using a MiSeq V2 paired-end 150 kit (Figure 2A and B).
Metagenomics analysis of pre-captured samples
We started by sequencing all pre-capture libraries to evaluate the genomic information we could obtain from the samples using this strategy. This is the simplest way to sequence clinical samples, especially when screening for unknown pathogens or co-infections. Although largely used in clinical studies, this method yields high levels of host sequence contamination. Consequently, the genomes of interest may not be detected using such a method.
Sample
|
Group
|
Total Number of Reads
|
% Reads Human
|
% Reads Other
|
% Reads SARS-CoV-2
|
4885
|
A
|
22.7M
|
95.3
|
4.7
|
0
|
4716
|
19.5M
|
95.5
|
4.5
|
0
|
4660
|
11.0M
|
69.9
|
30
|
0
|
4520
|
16.0M
|
91.4
|
8.6
|
0
|
4707
|
B
|
23.9M
|
95.5
|
4.5
|
0
|
4697
|
27.6M
|
62.4
|
38
|
0
|
4676
|
16.9M
|
83.1
|
17
|
0
|
4653
|
13.9M
|
66.8
|
33
|
0
|
4861
|
C
|
22.3M
|
95.9
|
4.1
|
0
|
4787
|
26.9M
|
92.9
|
7.1
|
0
|
4688
|
41.0M
|
83.4
|
17
|
0
|
4673
|
22.2M
|
39.1
|
61
|
0
|
4777
|
D
|
20.6M
|
91.4
|
8.6
|
0
|
4668
|
25.3M
|
87.0
|
13
|
0
|
4510
|
9.3M
|
94.5
|
5.5
|
0
|
4489
|
19.9M
|
94.4
|
5.6
|
0
|
4798
|
E
|
19.6M
|
93.9
|
6.1
|
0
|
4797
|
16.1M
|
95.1
|
4.9
|
0
|
4656
|
19.6M
|
45.5
|
54
|
0
|
4544
|
25.1M
|
92.7
|
7.3
|
0
|
Table 2 . Summary of the results of metagenomic sequencing of pre-capture libraries sequenced on the NextSeq 500 in high-output mode. Percentages represent the percentage of assignation using k-mer analysis with Kraken2 software. Other: virus + bacteria + unclassified.
We obtained an average of 21M reads for each sample (Table 2). Taxonomic analysis detected mostly human sequences, followed by bacteria (Figure 3, left panel). The presence of viruses was marginal. No reads matched the SARS-CoV-2 genome according to our k-mer analysis (Figure 3, right panel) confirmed by standard mapping, suggesting that either the samples did not contain any SARS-CoV-2 sequences or that the sequencing depth was insufficient to detect the low-abundance SARS-CoV-2 in our samples.
Other studies [8, 9, 10] have managed to obtain a full genome from metagenomic sequencing of samples with the same RT-qPCR CTs as those used in this study. However, the number of reads obtained was 70-fold higher than in our study [10]. Although the price per Gb sequenced has dropped significantly, we consider this to not be cost effective and we did not upscale the sequencing strategy of the metagenomic samples.
Capture results: multi-virus panels
We started by performing capture of all samples with the three multi-virus panels from two different vendors: the Twist Pan-Viral Panel and the Illumina Respiratory Virus Panels. The Twist panel is composed of probes that target 1,160 targets, corresponding to more than 1,000 human pathogenic viruses. The Illumina panel is composed of 177 targets, corresponding to 83 respiratory viruses and 94 human genes. A list of the accession numbers and gene names is provided in the github repository (https://github.com/biomics-pasteur-fr/manuscript_capture_seq/). As we were performing this benchmark, Illumina released version 2 of its respiratory virus panel, which showed better overall results than the previous version (Figure S1 A-B). The results described in this paper correspond to those generated using version 2 of the Illumina panel.
Twist Pan-Viral Panel
The sequencing run of the captured samples using the Twist Pan-Viral Panel generated between 1M to 26M reads per sample (Table 3). As the main goal of this experiment was the detection of SARS-CoV-2 sequences in patient samples, we mapped all data obtained using the SARS-CoV-2 reference genome. For the Twist Pan-Viral Panel, the SARS-CoV-2 virus was poorly captured from samples, with between 0 and 3.2% reads mapped to the reference and a breadth of coverage (BOC hereafter) below 40% (Figure 4A). Interestingly, sample 4489 from group D was enriched for SARS-CoV-2 sequences and had a BOC of 12.2%. This is quite surprising, as all samples with a CT > 32 were not enriched by this panel.
|
|
Twist Pan-Viral Panel
|
Illumina RESV 2 Panel
|
Arbor SARS-CoV Panel
|
Twist SARS-CoV Panel
|
Sample
|
Group
|
Reads (M)
|
Map (%)
|
BOC
|
DOC
|
Reads (M)
|
Map (%)
|
BOC
|
DOC
|
Reads (M)
|
Map (%)
|
BOC
|
DOC
|
Reads (M)
|
Map (%)
|
BOC
|
DOC
|
|
4885
|
A
|
8.7
|
0.2
|
25
|
66
|
3.1
|
16
|
100
|
2E+04
|
0.4
|
89
|
100
|
2E+03
|
0.5
|
81.7
|
100
|
2E+03
|
|
4716
|
7.4
|
0.2
|
29
|
76
|
3.1
|
25
|
100
|
3E+04
|
0.8
|
94
|
100
|
4E+03
|
0.7
|
83.8
|
100
|
2E+03
|
|
4660
|
3.9
|
3.3
|
33
|
506
|
4.9
|
75
|
100
|
2E+05
|
3.5
|
94
|
100
|
1E+04
|
2.6
|
95.0
|
100
|
9E+03
|
|
4520
|
7.7
|
1.4
|
40
|
441
|
9.0
|
43
|
100
|
2E+05
|
2.5
|
92
|
100
|
1E+04
|
2.2
|
89.6
|
100
|
8E+03
|
|
4707
|
B
|
9.5
|
0.0
|
22
|
17
|
4.9
|
4.1
|
100
|
9E+03
|
0.8
|
80
|
100
|
3E+03
|
0.5
|
48.4
|
100
|
932
|
|
4697
|
5.0
|
0.0
|
10
|
1.4
|
3.4
|
0.3
|
98
|
516
|
0.1
|
43
|
98
|
139
|
0.6
|
3.0
|
98
|
70
|
|
4676
|
5.8
|
0.4
|
25
|
99
|
3.2
|
29
|
100
|
4E+04
|
2.1
|
92
|
100
|
9E+03
|
1.4
|
84.8
|
100
|
4E+03
|
|
4653
|
4.0
|
1.8
|
31
|
281
|
5.5
|
59
|
100
|
1E+05
|
5.8
|
94
|
100
|
2E+04
|
5.5
|
91.6
|
100
|
2E+04
|
|
4861
|
C
|
4.3
|
0.1
|
23
|
14
|
1.8
|
7.5
|
100
|
6E+03
|
7.7
|
84
|
100
|
3E+04
|
3.6
|
56.1
|
100
|
8E+03
|
|
4787
|
4.9
|
0.0
|
4.5
|
0.2
|
2.1
|
0.1
|
53
|
51
|
0.5
|
6
|
55
|
138
|
1.0
|
1.4
|
56
|
55
|
|
4688
|
19
|
0.0
|
1.4
|
0.1
|
16
|
0.0
|
26
|
17
|
0.6
|
2
|
27
|
45
|
1.7
|
0.3
|
28
|
18
|
|
4673
|
2.6
|
0.0
|
6.9
|
0.9
|
1.6
|
0.4
|
93
|
249
|
0.5
|
54
|
95
|
1E+03
|
1.5
|
5.4
|
94
|
315
|
|
4777
|
D
|
7.8
|
0.0
|
0.0
|
0.0
|
8.5
|
0.0
|
0.0
|
0.0
|
2.2
|
0.0
|
0.0
|
0.0
|
2.1
|
0.0
|
0.3
|
0.0
|
|
4668
|
6.7
|
0.0
|
0.0
|
0.0
|
4.1
|
0.0
|
0.0
|
0.0
|
1.8
|
0.0
|
1.5
|
0.0
|
1.3
|
0.0
|
1.2
|
0.0
|
|
4510
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
2.4
|
46.9
|
95
|
4E+03
|
|
4489
|
7.6
|
0.0
|
12
|
1.2
|
5.3
|
0.3
|
94
|
776
|
3.7
|
30
|
95
|
5E+03
|
1.7
|
6.6
|
94
|
468
|
|
4798
|
E
|
2.2
|
0.0
|
0.0
|
0.0
|
0.5
|
0.0
|
0.0
|
0.0
|
2.0
|
0.0
|
1.2
|
0.0
|
1.1
|
0.0
|
1.6
|
0.0
|
|
4797
|
2.6
|
0.0
|
0.0
|
0.0
|
0.9
|
0.0
|
0.0
|
0.0
|
3.9
|
0.0
|
3.6
|
0.0
|
1.4
|
0.0
|
0.9
|
0.0
|
|
4656
|
1.5
|
0.0
|
0.0
|
0.0
|
0.7
|
0.0
|
0.0
|
0.0
|
1.1
|
0.0
|
0.0
|
0.0
|
2.5
|
0.0
|
0.0
|
0.0
|
|
4544
|
26
|
0.0
|
0.0
|
0.0
|
22
|
0.0
|
0.0
|
0.0
|
0.4
|
0.0
|
0.0
|
0.0
|
2.7
|
0.0
|
2.7
|
0.0
|
|
Table 3. Mapping results for all samples captured using the different SARS-CoV-2 panels. All data was mapped to the Wuhan SARS-CoV-2 sequence (accession MN908947.3).
These results suggest that although the viral sequence was captured, it was not captured entirely. It is possible that the efficiency of the probes to specifically capture the SARS-CoV-2 reads was lower due to the size of the panel (i.e., number of probes covering the entire genome). It is also possible that we lost SARS-CoV-2 information from our samples due to degradation or the library preparation.
We performed taxonomic analysis of all samples to understand why we obtained such a small percentage of reads corresponding to our target. The main objective was to classify the reads not mapping to the SARS-CoV-2 reference sequence. With the exception of samples 4544 and 4688, we obtained between 65% and 97% of reads with hits for the human database (Figure 5A). Interestingly, most of the reads of samples 4544 and 4688 corresponded to Primate Bocaparvovirus 1 and Human orthopneumovirus, respectively. Both viruses are present in the panel and, for sample 4688, we found a number of SARS-CoV-2 reads as well, covering 1% of the entire viral genome. Remarkably, the taxonomic analysis of certain samples also classified a significant proportion of reads into the bacteria kingdom (Figure 5A). Although they represented only 0.2 to 23% of the reads, they corresponded to different phyla and dozens of different species.
Illumina Respiratory Virus Panel
The sequencing run of the captured samples using the Illumina Respiratory Virus Panel generated between 3M and 21M reads per sample. After mapping the reads to the SARS-CoV-2 reference sequence, the samples captured using the Illumina Resv2 Panel showed a higher percentage of reads on target than those captured using the Twist Pan-Viral Panel (Figure 4A). For samples with a CT < 30 (Groups A/B), the enrichment of SARS-CoV-2 reads was marked (average of 31.4%). For 6 of the 8 samples (in groups A and B), we obtained a BOC of 100%, meaning that this panel managed to capture the entire viral genome. Group C showed a lower percentage of enrichment and a lower BOC percentage (between 26 and 99%). Groups D and E were not enriched for SARS-CoV-2 sequences, with the exception of sample 4489 (as for the Twist Pan-Viral Panel), which had a BOC of 95% for the SARS-CoV-2 genome.
Taxonomic analysis showed between 5 and 69% of the reads mapping to the human database (control genes captured by the panel excluded) (Figure 5B). Again, most of the reads of samples 4544 and 4688 were identified as Primate Bocaparvovirus 1 and Human orthopneumovirus, respectively. Both viruses are also present in the Illumina panel, confirming the results of the Twist Pan-Viral capture. In addition to Human Orthopneumovirus, we detected SARS-CoV-2 reads from sample 4688, covering 26% of the entire viral genome. Using this panel, we also found bacteria in a number of the samples, with the same phyla detected as during the sequencing of the Twist Pan-Viral captured samples.
SARS-CoV-2 panels
For this benchmark, we tested two different SARS-CoV-2 panels, one from Twist and another from Arbor Biosciences. The main difference between these two panels is that the Arbor probes consist of ssDNA and those of Twist dsDNA. Another difference between those two panels concerns how the capture is performed: Arbor preconizes a double-capture protocol for low-abundance targets, whereas that of Twist is a classic single capture protocol.
Twist SARS-CoV-2 Capture Panel
Sequencing runs of this capture panel generated between 0.5M and 5M reads per sample. Contrary to the multi-virus kits, between 3 and 94% of the reads mapped to the SARS-CoV-2 reference sequence for samples within groups A and B (Figure 4B). Consistent with the results obtained with the Illumina multi-viral panel, samples 4707 and 4697 showed a lower percentage of on-target reads than other samples with similar CT values. All samples of these groups had a BOC between 98 and 100%. Group C showed between 0.2 and 56% on-target reads and a BOC ranging from 28 to 100% (Table 3). With the exception of sample 4489 (as for the multi-virus panels), with 6% on-target reads and a BOC of 94%, samples of groups D and E had few on-target reads, corresponding to 0.3% and 2.7% of the SARS-CoV-2 genome, respectively. These reads were missed by both multi-virus panels tested in this study.
Taxonomic analysis of all samples showed between 1 and 96% of the reads classifying as human. As expected, the presence of human reads was higher for samples with high CT values (Figure 5C). Interestingly, samples 4544 and 4688, for which most of the reads were identified as Primate Bocaparvovirus 1 and Human orthopneumovirus, respectively, with the multi-virus panels, showed no detection of those two viruses after capture with the Twist SARS-CoV-2 Panel. Indeed, 95% of the sample 4544 data matched the human database, and 75% for sample 4688. However, we detected SARS-CoV-2 reads from sample 4688, covering 27% of the entire viral genome, confirming the presence of the virus in this sample. These results suggests high specificity of the Twist SARS-CoV-2 probes to capture the virus. Even though these two samples contained an abundance of other viruses, these probes managed not only to avoid capturing them but also to specifically capture the SARS-CoV-2 reads present in sample 4688. We also observed the presence of bacteria in all samples, between 0.01% and 64%. As shown previously, samples with a high percentage of bacteria did not have a predominant phylum or species, but a mixture of dozens of different species.
Arbor SARS-CoV-2 Panel
Holmes et al. [12] demonstrated the advantages of double capture when the target genome within samples is scarce. We performed a double capture of our samples using this panel, as recommended by the manufacturer. Globally, this panel showed the best percentage of on-target reads of all panels tested, reaching up to 94% (Figure 4B). For almost all samples in which the virus was detected, the percentage of on-target reads was higher using the Arbor panel than the Twist panel. Moreover, the BOC results were very similar to those of the Twist SARS-CoV-2 Panel.
Taxonomic analysis showed between 0.2% and 98% reads matching the human database (Figure 5D). Although the percentage of reads classified as human was equivalent to that for the Twist SARS-CoV-2 capture panel, there was a significant drop in the percentage of reads classified as bacterial, with a maximum of 7% of reads matching the bacteria database. Concerning samples 4544 and 4688, once again, we did not capture reads corresponding to the other viruses present in these samples. These results show the efficiency of this panel to specifically capture SARS-CoV-2 virus in the presence of other viral genomes. The percentage of reads targeting the human genome was 96% for both samples, suggesting that, in the absence of the target, the probes preferentially bind to sequences of the human genome rather than those of other viral genomes present in the sample.