Study design
The serum samples tested by untargeted mNGS come from a larger cohort of 3192 consecutively paediatric patients (2 to 59 months of age), recruited between December 2014 and February 2016, at nine outpatient clinics in Dar es Salaam, Tanzania, with the inclusion criteria of “acute febrile illness” (axillary temperature ≥ 37.5 °C). The detailed characteristics of this cohort are described in Keitel et al. [20].
A sub-cohort of interest comprising 1233 patients was then selected (Fig. 1) based on three diagnoses which carried the highest theoretical probability of detecting blood viruses with significant clinical consequences and for which mNGS would most increase the diagnostic evidence for the aetiology of the illness. These diagnoses were obtained using by the clinical decision algorithms of the e-POCT study [20]: 1) “Fever without focus” representing children who would most benefit from further diagnostics; 2) “Severe illness”, representing children in whom the impact of blood viruses may be investigated. 3) “Malaria”, in whom blood viruses may interact with this important and frequent parasite and/or be less likely as the cause of the febrile episode.
A total of 823 patients (resulting in 823 sera samples) were then randomized from this sub-cohort of interest and submitted to the entire process of mNGS (detailed below). Only samples with a minimum of one million reads for both RNA and DNA libraries were considered for further bioinformatic analysis, resulting in the exclusion of 4 samples. Three samples were further excluded; two consecutive to a technical problem encountered during the library procedures, and one due to an overrepresentation of insect sequences after analysis of raw sequencing data [22]. Therefore, 816 of the 823 initial selected samples were considered for analysis.
Sample extraction and untargeted metagenomic next-generation sequencing
Two hundred twenty microliters (ul) of each serum sample was centrifuged 10,000 X g for 10 min and treated with 40 U of Turbo DNAse (2U/ul) (Ambion, Rotkreuz, Switzerland). Of the total resulting 244 µl, 240 µl were recovered to perform the nucleic acid extraction. RNA and DNA virus genome extractions were performed as previously described [23]. Briefly, RNA was extracted with TRIzol (Invitrogen, Carlsbad, CA, USA) and DNA with a NucliSens easyMAG magnetic bead system (bioMérieux, Geneva, Switzerland) followed by a double-stranded DNA synthesis with DNA polymerase I, Large Fragment (Klenow) (New England BioLabs, Ipswich, MA, USA). RNA and DNA pellets were then resuspended in respectively 10 ul and 5 ul of nuclease-free water (Promega, Dübendorf, Switzerland), and quantified using the Qubit 3.0 Fluorometer (Life Technologies, Carlsbad, CA, USA) and adjusted at 2 ng/ul.
For RNA virus detection, ribosomal RNA was removed using the Ribo-Zero Gold depletion kit (Illumina, San Diego, US) prior to library preparation. RNA libraries were prepared with the TruSeq total RNA preparation protocol (Illumina) before being multiplexed by four on the HiSeq 2500 platform (Illumina) using the 2 × 100 bp paired-end protocol. The mean of total number of read pairs obtained was 4.45E7 (range from 2E6 to 1.09E8).
For DNA virus detection, libraries were prepared with the Illumina Nextera XT protocol (12 PCR cycles) before being multiplexed by six on the HiSeq 4000 platform (Illumina) using the 2 × 100 bp paired-end protocol. The mean of total number of read pairs obtained was 5.46E7 (range from 2E6 to 2.52E8).
Both RNA and DNA library concentrations were measured using Qubit (Life Technologies). The size distribution of fragments was controlled using a 2200 TapeStation (Agilent, Santa Clara, CA, USA).
To assess the presence of potential mNGS contaminants (i.e. false positive results), a “no-template” control (NTC) submitted to the whole procedure was included to each sequencing run. In order to assess the entire process efficiency (i.e. from sample preparation to bioinformatics analysis) each sequencing run included a virus-spiked positive control (PC). This latter is also useful in addition to the NTC samples to evaluate the presence of potential sporadic mNGS contaminants. For the RNA virus procedure, the PC consists of a canine distemper virus (CDV)-spiked positive control (RNA-PC), whereas for the DNA virus procedure, a baculovirus (GenScript, Piscataway, NJ, USA) harboring 793 nucleotides of the CDV fusion gene (corresponding to positions 5505 to 6297 from the GenBank KY971529 reference) was used as spiked-positive controls (DNA-PC).
Bioinformatic analysis
For each sample, the reads generated by the Illumina platforms (RNA and DNA libraries) were processed for virus detection using a bioinformatics pipeline (FeVir) designed to detect all vertebrate viruses and a de novo assembly sequencing approach.
Bioinformatics pipeline for vertebrate virus detection
Paired reads were quality filtered and mapped against Virosaurus (version V90v_2018_11) [21], a curated database for all known vertebrate viruses, using virusscan 1.0 [24]. The mapping was performed allowing up to 12 mismatches per tag but at most 15 mismatches for the pair, ultimately retaining the lowest possible mismatch. Sequence pairs whose lowest mismatch count mapped equally well onto more than 512 distinct viruses entries were discarded. Non-redundant mapping results were used to compute metrics for mapped reads and coverage. Then for each virus species (or genus for anelloviridae) detected, results with the best total genome coverage value were kept. Only results with coverage ≥ 300 nucleotides were considered as positive.
Index hopping/lane cross-contamination was removed using a 1% threshold. An important challenge in the interpretation of mNGS data is to distinguish clinically significant viruses from those that are known to be medically irrelevant and also to differentiate novel viruses (not previously described in humans) from those that are more likely to be environmental or mNGS reagent contaminants [25]. Therefore, results for viruses detected in processes controls and endogenous retrovirus were not reported.
For segmented viruses, results for each segment were summed. For Herpesviridae, each gene’s results were summed using a non-redundant and non-overlapping list of genes.
Viral hits were classified into 3 groups:
1) Those of “recognized clinical significance” including viruses known to infect humans and well recognized in clinical practice as disease-causing agents. As for most pathogens the spectrum of associated illness associated with this group of viruses ranges from mild symptoms (or even asymptomatic infection) to life-threatening disease.
2) Commensal viruses of “undetermined clinical significance” that comprise viruses that infect human but for which the association with a specific disease has not yet been established. Most of these viruses however have shown to interact with the human immune response and have been referred to by some as commensal agents. The two main representatives are Anelloviridae and human pegivirus-1.
3) Those of “unknown significance” that represent viral agents not usually recognized as infecting humans and/or have never fulfilled Koch’s postulates that would attribute a theorical pathogenic role [26]. Thus, it could reflect transient or aborted infections, environmental spill over, contaminants from skin or other body fluids or spurious mNGS reagent contaminants. We consider this group of particular interest to surveillance efforts, since it could signal contact with potential emerging human pathogens and guide targeted screening in the future.
De novo analysis
The de novo analysis was performed as previously published [27]. First, reads were trimmed to remove low-quality and adapter sequences using Trimmomatic (v0.33). Human reads were removed by mapping reads against the human genome and transcriptome (hg38, gencode.V23) using the SNAP nucleotide aligner program [28]. Then, de novo assembly was performed using IDBA-UD (v.1.1.3) [29] and the generated contigs of greater than 2000 nucleotides were blasted (blastx, v.2.3.0+) [30] against the U-RVDBv12.2 viral database [31]. Anelloviridae contigs were filtered out and processed separately. The newly obtained sequences were then cross-checked by BLAST (blastn, blastx) against GenBank (Nucleotid collection (nr/nt), Non-redundant protein sequences (nr)) and NCBI conserved domain tool. Novel virus contigs with a presumed bacterial or fungi host were not considered for further analyses.
Sample and control FASTQ files were additionally filtered for low-complexity sequences, using tagdust (v2.31). Remaining reads were mapped against assembled genomes that were not already detected by the previously described FeVir pipeline, using SNAP. For each new hit, mapped reads and coverage metrics were then computed.
Only results with ≥ 300 nucleotides of coverage were considered. Results for viruses detected in processes controls were not reported.
Anelloviridae sequence analysis
Anelloviridae contigs with a complete ORF1 larger than 600 amino acids were separated into Alpha-, Beta- and Gamma-torquevirus genera using multiple alignment (muscle v3.8.31) with the reference sequences of ORF1. Alternative “ACG” initiation codon was also used to detect Alphatorquevirus ORF1.
To estimate the proportion of assembled contigs that was “new” compared to reported sequences in GenBank (Query: alphatorquevirus[Organism] OR betatorquevirus[Organism] OR gammatorquevirus[Organism] OR unclassified Anelloviridae[Organism], downloaded in March 2019), we first excluded non-primate sequences, then filtered as above and added to the new assembled sequences. For the three genera, complete ORF-1 sequences were separated in clusters of 80% nucleotide identity, using cdhit-est (CD-HIT v4.6). For each cluster, sequences of 90% identity were selected using cdhit-est. Then representative sequences were aligned two by two using muscle (v3.8.1551), when two clusters contain reference sequences that shared more than 80% nucleotide identity, clusters were merged.
To determine the number of distinct Anelloviridae of the same genus present in each sample, the filtered FASTQ files were mapped to the ORF1 representative sequence using SNAP, for each cluster, positive detection was considered if 50% or more of the ORF1 was covered.
Additionally, for each of the assembled Anelloviridae contigs that contained a complete ORF1, circular genomes were linearized starting after the GC rich region and incomplete genomes were trimmed the same way. Finally, representative genomes were selected at 90% nucleotide identity with cdhit-est. Then filtered FASTQ files were mapped against the representative genomes with SNAP. Metrics were computed and results were processed as described above (≥ 300 nucleotides, 1% index-hopping/lane cross-contamination threshold). Results with the best total genome coverage value for each genus were reported if previously found negative with the FeVir pipelin