Viral metagenomic detection
Workflow overview and rationale
One of the main developments since INSaFLU’s first release [19] focused on upgrading the platform for automated metagenomic virus identification, in order to support both human and veterinary clinical practice and disease outbreak investigations. After reviewing the current state-of-the-art of the field of bioinformatics pipelines for metagenomic virus diagnostics [18,21–26] and consulting the TELEVIR consortium (Public Health and Veterinary institutes across all Europe), a modular pipeline was designed and developed, incorporating the key steps of NGS metagenomics taxonomic classification and reporting (Figure 1 and Figure 2), namely: read quality control, viral enrichment / host depletion, de novo assembly, reads/contigs taxonomic classification, confirmatory reference-based remapping and reporting. The choice of the internal components of the implemented workflows (software, default parameters, etc.) resulted from an extensive benchmarking (next section). Details of TELEVIR resources, benchmarking and implementation are detailed in Additional file 1. In summary, the input/output flow and main functionalities behind the main TELEVIR steps are as follows:
Read quality analysis and improvement: This step takes the input single- or paired-end reads (fastq.gz format; Illumina, Ion Torrent or ONT) and produces quality processed reads, as well as quality control reports for each file, before and after this step. This step is performed automatically following sample upload and thus overlaps the two components (virus detection and genomic surveillance) of the INSaFLU-TELEVIR platform. Quality filtering and trimming of Illumina reads is performed as described in Borges et al. (2018) [19], treatment of ONT data is described below. An optional, extra filtering layer that targets low complexity reads is available as part of the TELEVIR pipeline using the software PRINSEQ [27]. Parameters are modifiable by the user.
Viral enrichment: This step retains potential viral reads based on a rapid and permissive classification of the reads against a viral sequence database. This step is performed directly over raw reads (if QC was turned OFF) or quality processed reads (if QC was turned ON).
Host depletion: This step removes potential host reads based on reference-based mapping against host genome sequence(s). Mapped reads are treated as potential host reads and removed. This step will act on virus enriched sequences, unless the viral enrichment step was turned OFF, in which case host depletion will be directly performed over raw / quality processed reads. Several host sequences are provided as default.
De novo assembly: This step performs de novo assembly using reads retained after the "Viral enrichment" and/or "Host depletion" steps. If the latter steps were turned OFF, assembly will be directly performed using raw / processed reads. Assembled contigs are automatically filtered for a minimum sequence length.
Identification of viral sequences: This step screens reads and contigs against viral sequence databases, generating an intermediate read and/or contig classification report: a list of viral hits (taxonomic identifiers - TAXID, and representative accession identifiers - ACCID) potentially present in the sample. TAXIDs bearing the keyword “phage” in their scientific name are filtered out.
Selection of viral TAXID and representative genome sequences for confirmatory analysis: In this step, the previously identified viral hits (TAXID) are selected for confirmatory mapping against reference viral genome(s) present in the available databases. Viral TAXIDs are selected, up to a maximum number of hits, under the following order: i) Viral hits corresponding to phages are removed from classification report; ii) TAXIDs present in both intermediate classification reports (reads and contigs) are selected; iii) additional TAXIDs are selected across the read classification report and contigs classification report by number of hits, in decreasing order, and total length of matching sequences, when available, until reaching the defined maximum number of hits to be selected (this number is to be user-defined). Finally, TAXIDs are queried against available databases for associated ACCIDs.
Remapping of the viral sequences against selected reference genomes: This step maps reads and contigs against representative genome sequences (ACCIDs) of the selected viral TAXIDs collected in the previous step. Reads are also mapped against the set of contigs classified for each TAXID. Of note, TAXIDs that were not automatically selected for this confirmatory remapping step (but that were present in the intermediate reads and/or contigs classification reports) can still be user-selected for mapping at any time. An optional, extra layer of “mapping stringency” was added to this step to minimize false positive hits, allowing users to set a maximum sum of the mismatch qualities before marking a read unmapped and a maximum fraction of nucleotide mismatches allowed before soft clipping from ends. This additional layer is optional and disabled by default.
Reporting: The workflow culminates in user-oriented reports on a list of the top suspected viruses (detailed in Usage section), each accompanied by several diagnostic-oriented metrics, statistics and visualizations, provided as (interactive) tables (intermediate and final reports), graphs (e.g., coverage plots, Integrative Genomics Viewer visualization, Assembly to reference dotplots) and multiple downloadable output files (e.g., list of the software parameters, reads/contigs classification reports, mapped reads/contigs identified per each virus; reference sequences, etc.). To further help the user in assessing the validity of the reported hits in a given sample, viral references are grouped by mapping overlap, as measured by the number of shared mapped reads. This grouping is capable of placing together true positive hits with their corresponding cross-mapped potential false positives, allowing for the easy identification of the latter. Grouping parameters are modifiable in the Reporting section of the TELEVIR Settings Menu for both sequencing technologies.
In the context of metagenomics in clinical virology, cross-mapping of reads across several host and pathogen reference sequences is very common, resulting in a high false positive rate [28–30]. The TELEVIR workflow provides a light-weight but robust approach, in classification and interpretation, to false positives. Firstly, it follows suggestions expressed in the literature to first filter out reads enriched with low complexity regions (e.g., homopolymeric tracts or short-repeats), as well as unwanted material (host or non-viral “contaminants”) through host depletion and / or viral enrichment [21,26,31]. These steps aim at decreasing background noise and increasing the performance [26,32,33] and the speed of both read classifiers and assemblers. In turn, besides the reads classification, the workflow takes advantage of increased precision of contig classification [34], which provides an additional, robust metric with which to assess the validity of the final results. This pipeline further innovates in tackling false positives by introducing a final confirmatory analysis that comprises automatic reference selection, remapping (including optional “mapping stringency” settings), hits grouping and warning flagging. By normalizing the classification reports and outputs for comparison, the interactive reports provide a uniform basis on which to confirm classifications and weed out false positives.
TELEVIR benchmarking
In order to finely explore the best approaches to be implemented in the TELEVIR toolbox, we benchmarked several workflows. For this, we tested combinations of the key modules that comprise the overall virus identification pipeline: Viral Enrichment, Assembly and Classification (of reads and contigs). Specifically, we tested software (such as Centrifuge, Diamond, Kaiju, Kraken2, KrakenUniq, BLAST) (Supplementary Table 1) and databases (such as NCBI, RVDB, UniProt, Virosaurus) (Supplementary Table 2) commonly used in virological diagnostic laboratories for clinical metagenomics, as well as some more recent but promising alternative classifiers (deSAMBA, FastViromeExplorer, Clark) (see mode details in Additional File 1 - Resources) In some instances, we also evaluated software performance by varying argument values (Supplementary Table 3). We further benchmarked the sorting algorithm used to rank candidate reference hits based on the results of the Read and Contig classification steps. For ONT data, we ran 117 combinations (i.e., different software, reference databases and/or parameters) on 20 samples (a total of 2340 runs). For Illumina data, we ran 108 combinations on 24 samples (2592 total runs). The reads used in the benchmark (Supplementary Table 4) covered a wide range of viruses (including influenza and SARS-CoV-2, but also bluetongue and epizootic hemorrhagic disease virus, among others) and hosts (including human, various ungulates and one culicoides specimen), and include a dataset of clinical samples from patients with encephalitis or viral respiratory infections, previously used to benchmark software for metagenomic virus diagnostics [24].
The design, methodological details and results of this extensive benchmark are described in the Additional file 1 (covering Supplementary Tables 1-7 and Supplementary Figures 1-5).In summary, we found that combining the information from contig and read classification in order to rank metagenomics candidate hits is indeed preferable than depending on a single classification source (Supplementary Figure 1). Regarding software selection: at the Viral Enrichment step, Kraken2 and Centrifuge performed the best for Illumina and ONT technologies, respectively, based on precision (Supplementary Figure 3 A-B); at the Contig classification step, we found that Nucleotide BLAST resulted in the highest number of successfully mapped contigs (Supplementary Figure 3 C-D); At the Read Classification step, we found significant differences in precision between several software for ONT, but not for Illumina (Supplementary Figure 3 E-F). In the end, our choice of software (Supplementary Table 8) reflected a trade-off between benchmark results at the module level (Supplementary Figure 3) and in combination (Supplementary Figure 4), providing the user with adequate cross-validation, and the constraints of implementing new software on an existing platform (see Additional file 1 - Section 4.1).
findONTime,a complementary tool to enable real-time metagenomics virus detection
When performing hypothesis-free viral diagnosis by sequencing complex biological samples, the proportion of the virus in a sample is unknown. As such, the amount of sequencing data, and, consequently, the run length needed to accurately detect a virus cannot be predicted a priori. These result in sequencing runs often being allowed to run overnight, at the expense of the material and, in the context of diagnostics, the potential detriment of patient or animal status. Inspired by existing examples in the field for real-time ONT targeted mapping and overview of genome coverage (e.g. RAMPART; https://artic.network/rampart), we envisaged a tool for continuous ONT run monitoring in the context of viral metagenomics that would allow users to cut short a sequencing run when sufficient pathogen sequence evidence has been gathered. As such, we developed findONTime (https://github.com/INSaFLU/findONTime), a command-line tool complementary to INSaFLU-TELEVIR platform that potentiates a time and cost effective real-time viral metagenomic detection. findONTime is a multi-threaded python package that runs concurrently with MinION sequencing to: i) monitor the demultiplexed FASTQ files (gzipped or not) that are being generated in real-time for each sample (the sequencing run should have the barcoding option ON); ii) merge the same-sample files (at user-defined time intervals), downsize them (on demand) and prepare a metadata table (according to the INSaFLU-TELEVIR template); and, if requested, iii) upload these files (ONT reads and metadata) to the INSaFLU-TELEVIR platform (local server via SSH or directly through docker, depending on a user provided configuration file); and, iv) launch the metagenomics virus detection analysis using the TELEVIR module. findONTime (https://github.com/INSaFLU/findONTime) is implemented in python 3.9 and is pip-installable (https://pypi.org/project/findontime/).
Routine genomic surveillance
- Reference-based genome assembly
With the recent advances in third-generation sequencing technologies (ONT) and their wider access through more portable and affordable equipments (MinION), it became necessary to deploy a reference-based genome assembly pipeline for ONT data analysis in the INSaFLU-TELEVIR platform, in addition to the existing workflow for Illumina and Ion Torrent data [19]. In order to keep the same dashboard usability across technologies (see Usage section), the implemented ONT workflow followed the same pipeline architecture (from raw reads to quality analysis, reference-based generation/curation of consensus sequences and mutation detection) and input/output flow and formats [19], but relying on open-source software specifically adapted to the characteristics of ONT data. First, sequencing technology (ONT or Illumina/Ion Torrent) is automatically inferred from the distribution of read lengths, upon read upload. Samples classified as ONT are QC filtered using NanoFilt [35], and statistics and reporting is generated using NanoStat [35] and RabbitQC [36]. Default parameters for NanoFilt, regarding average read quality, minimum/maximum read length and start/end trimming size (Supplementary Table 8), were selected to provide a trade-off between quality and read length, but are open to user configuration to fit to sample characteristics and the upstream experimental conditions, etc. Post-QC reads are then processed by medaka (https://github.com/nanoporetech/medaka) using “consensus” and “variant” modes to generate raw consensus sequences (FASTA) and mutation lists (VCF), respectively. After a calculation of depth of coverage per site, mutations present in the raw VCF files are filtered out based on user-configurable criteria: i) minimum depth of coverage per site (--mincov) (default: 30); and, ii) minimum proportion for variant evidence (--minfrac) (default: 0.8). Intermediate consensus sequences are then generated using bcftools [37] based on the VCF file containing the validated mutations. The last step of consensus sequence curation involves the automatic placement of undefined nucleotides (“N”) in: i) low coverage regions (i.e., regions with coverage below --mincov), using “msa_masker (https://github.com/rfm-targa/BioinfUtils/blob/master/FASTA/msa_masker.py); ii) mutations with frequencies between 50% and the user defined “--minfrac”; iii) regions (or sites) selected to be masked by the user (e.g., regions falling outside the amplicon schema). Steps i) and iii) of consensus curation were also incorporated in the existing Illumina/Ion Torrent workflow [19], which is also similar in all subsequent steps of mutation annotation (using snpEff) [38], alignment (using MAUVE and MAFFT) [39,40] and rapid phylogenetics (using FastTree) [41], as previously described [19].
INSaFLU benchmarking
The INSaFLU reference-based genome assembly pipeline for Illumina data analysis was previously benchmarked for influenza virus [19] using the IRMA pipeline [42] for comparison. In the present study, we performed additional benchmarking for SARS-CoV-2, comparing INSaFLU with a commonly used command-line bioinformatics workflow (https://github.com/andersen-lab/HCoV-19-Genomics), involving BWA for reads mapping [43] and iVar (https://github.com/andersen-lab/ivar; https://andersen-lab.github.io/ivar/html/manualpage.html) for QC and consensus generation [44]. The newly implemented INSaFLU pipeline for ONT data was also benchmarked against the widely used ARTIC SARS-CoV-2 pipeline (https://github.com/artic-network/fieldbioinformatics/). The methodological details and results of the two benchmarks are described in Additional file 2. In summary, for Illumina, the comparative analysis of the obtained consensus sequences using INSaFLU versus BWA/iVar workflow demonstrated similar performance by both pipelines, but underlined the expected added value of incorporating an extra step of targeted primer clipping from the BAM file, as implemented in iVar (Additional file 2). In this context, the specific iVar primer clipping step (including primer trimming from aligned reads, as well as removal of aligned reads containing minor variants matching primer sequence but differing from the consensus sequence) was incorporated in both Illumina and ONT pipelines. The primer schema (used for amplification) is selected in the dashboard by the user upon Project creation and before adding samples. The final benchmark results using the upgraded INSaFLU workflows confirmed that the INSaFLU consensus generation performs similarly to widely used Illumina and ONT pipelines (detailed discussion in Additional file 2).
- Additional implementation of surveillance-oriented functionalities
In parallel to the refinement of the reference-based genome assembly pipelines, we integrated other important surveillance-oriented (often virus-specific) functionalities and features into the platform. The “References” default database (open to all users) was continuously enriched with genome sequences relevant for surveillance of viruses other than influenza, namely SARS-CoV-2, RSV and MPXV. The step of rapid virus identification/classification upon reads upload, as originally described [19], was also strengthened by accommodating ONT data, through draft assembly using Raven [45], and by upgrading the database of genetic markers to rapidly identify the presence of Human Betacoronavirus (namely, HCoV-OC43, HCoV-HKU1, MERS, SARS, and SARS-CoV-2), RSV A and B, as well as MPXV. Pangolin (https://github.com/cov-lineages/pangolin) [46] was incorporated for SARS-CoV-2 Pango lineage classification (default settings, usher mode), with pango software and databases automatically updated on a daily basis to provide up-to-date (re-)classification of new and old sequences. As a complement, direct hyperlinks to Nextclade (https://clades.nextstrain.org/) are automatically provided for rapid and flexible clade/lineage classification and quality analysis of SARS-CoV-2, seasonal influenza, MPXV and RSV consensus sequences (INSaFLU consensus sequences are directly analyzed at client side on the browser). Other main improvements of the surveillance component involved the incorporation of Nextstrain (https://nextstrain.org/) [47], and the development and integration of algn2pheno (https://github.com/insapathogenomics/algn2pheno), as described in the next sections.
Integration of Nextstrain phylogeographic and temporal analysis
Nextstrain (https://nextstrain.org/) [47] relies on reproducible and open-sourced pathogen-specific workflows for genomic data curation, analysis and visualization of integrated phylogeographical and temporal data, towards a real-time tracking of pathogen evolution. Hence, we strengthened the INSaFLU-TELEVIR surveillance component by integrating Nextstrain phylogeographic and temporal analysis of several viruses, namely SARS-CoV-2, seasonal and avian influenza, MPXV and RSV. We performed minor adjustments to the original Nextstrain workflows in order to: i) change the input source so that the implemented workflow incorporates user-provided sequences (via INSaFLU or by direct user upload) instead of fetching from databases; ii) relax some of the sequence filtering (more important when fetching from external databases) to maximize the number of input sequences included in the final tree (most often consensus from INSaFLU projects that already passed user-provided quality filters); and, iii) reduce input/output complexity, by removing some pathogen-specific inferences that require metadata that may not always be available (e.g., clinical onset date). For example, in the specific case of seasonal influenza, we removed fitness inferences and allowed more ambiguity and divergence among input sequences. Moreover, users may want to analyze organisms for which there is no specific Nextstrain build available. To cover these cases, we included a generic build (with or without temporal data) which performs alignment and tree-building using augur [48], either with the user-provided reference as root (when temporal data is not provided), or inferring the root and molecular clock from user-provided temporal sample metadata. The generic builds and the INSaFLU-adapted species-specific Nextstrain workflows are kept in a separate repository available at https://github.com/INSaFLU/nextstrain_builds. We regularly update our workflows with changes to the repositories of the original workflows, namely, e.g., with information regarding clades.
algn2pheno-screening of potential genotype-phenotype associations
In the course of the evolution of any given pathogen, multiple mutations and combinations of mutations are continuously arising. Although many/most of these mutations usually have no effect or relevance, sometimes it is possible to associate some mutations with specific phenotypes or characteristics (such as antiviral resistance, resistance to neutralizing antibodies, enhanced affinity to host-receptors antibodies or enhanced transmissibility, etc.), when rich epidemiological, clinical and/or biological data are available. In this sense, in the context of viral genomics surveillance, it is crucial to be able to rapidly detect and report such known mutations of interest in genomic sequences. As such, we developed algn2pheno (https://github.com/insapathogenomics/algn2pheno), a tool that screens an amino acid or nucleotide alignment against a given "genotype-phenotype" database. algn2pheno is implemented in the routine genomic surveillance module of INSaFLU and automatically screens SARS-CoV-2 Spike amino acid alignments in each SARS-CoV-2 project against three default “genotype-phenotype” databases: the COG-UK Antigenic mutations (https://sars2.cvr.gla.ac.uk/cog-uk/), the Pokay Database (https://github.com/nodrogluap/pokay/tree/master/data) and a database of mutations in Spike epitope residues compiled by Carabelli and colleagues [49]. algn2pheno detects all the mutations in each sequence, maps the mutations to the three databases and generates final reports with the repertoire of mutations of interest present in each sequence and their potential linkage to specific phenotypes.
This tool is also available as a standalone command line tool (https://github.com/insapathogenomics/algn2pheno) and was designed for flexibility in adaptation to different pathogens and customized databases. Users must provide an amino acid / nucleotide alignment including the sequences under analysis (and the the reference sequence, as mutation numbering will refer to this sequence) and a “genotype-phenotype” database in either tab separated (.TSV) or excel (.XLSX) format. algn2pheno will generate two main outputs (among other useful intermediate files): i) a final report (tabular format with a row per sequence) that lists all the “flagged mutations” (i.e., mutations in the database that were identified in the sequences), the phenotypes associated with those mutations and a list of all the mutations in each sequence; and, ii) a binary matrix with the mutations and the "associated" phenotypes identified in all sequences. algn2pheno is implemented in Python 3.9 and is freely available at https://github.com/insapathogenomics/algn2pheno (including usage examples and detailed output description).
Installation and software availability
Easier installation using docker
Although the INSaFLU-TELEVIR website is freely available for public use, it might become limiting when robust high-speed internet is not easily available (e.g., for use in the field), when high volumes of data are uploaded (subjected to queue and computational constraints) as well as when there are any other limitations (e.g., institutional, legal, ethical, etc.), preventing the upload of sequence and/or descriptive data to external servers. It is thus essential that the INSaFLU-TELEVIR platform also be made available locally. Although fully based on open source software, it depends on many different tools, making it relatively complex to install and configure manually. To facilitate the local installation of INSaFLU-TELEVIR, we used the docker containerization system [50] to automate the installation process, making it possible for users with limited informatics knowledge to install the system. Users just need to install docker in their system, download the INSaFLU docker from github (https://github.com/INSaFLU/docker) and run a small number of commands. During this process, users can also decide not to commit to installing the virus detection module (TELEVIR) if they only need the routine genomic surveillance components (INSaFLU and Nextstrain). Although designed to be installed on a computer running a Unix-like operating system, the docker installation can also be used in a Windows platform (e.g., a laptop) using WSL, as long as sufficient computational resources are available. The minimum recommended RAM is 32G, if the virus detection module is installed, but can go down to 16G if only the genomic surveillance system is installed. One recommended use-case is the installation (e.g., by a (bio)informatician) of an INSaFLU-TELEVIR instance to be shared within an institution. Also, when a local docker instance is available, findONTime (see previous sections) can be used to upload reads to the local instance, and automatically create and run viral detection projects, reducing hands-on time.
A snakemake workflow to facilitate execution in a compute cluster
The main driver for the development of the INSaFLU-TELEVIR website was the empowerment of less capacitated laboratories, facilitating the implementation and usage of bioinformatics workflows for viral metagenomic diagnostics and routine genomic surveillance. Nonetheless, it may become cumbersome to use the INSaFLU web-based interface with a very large number of samples. Also, some laboratories may want to use the analysis pipelines available in INSaFLU through internal computational infrastructures, such as compute clusters. Moreover, the website is not a practical testbed for the development of new functionalities or integration of alternative approaches to the existing pipelines. To cater for these needs, we have implemented the functionality of the genomic surveillance module of INSaFLU as a Snakemake [51] workflow. We make use of Snakemake’s support for conda to facilitate installation of external software, and its slurm support to facilitate execution in compute clusters. The workflow is available at https://github.com/INSaFLU/insaflu_snakemake, including instructions on how to use it. Using the benchmark datasets described above, the INSaFLU snakemake workflow produced the same consensus sequences as the public website (Additional file 2).