INSaFLU-TELEVIR: an open web-based bioinformatics suite for viral metagenomic detection and routine genomic surveillance

doi:10.21203/rs.3.rs-3556988/v1

Download PDF

Research Article

INSaFLU-TELEVIR: an open web-based bioinformatics suite for viral metagenomic detection and routine genomic surveillance

https://doi.org/10.21203/rs.3.rs-3556988/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 24 Apr, 2024

Read the published version in Genome Medicine →

Version 1

posted

You are reading this latest preprint version

Background

Implementation of clinical metagenomics and pathogen genomic surveillance can be particularly challenging due to the lack of bioinformatics tools and/or expertise. In order to face this challenge, we have previously developed INSaFLU (https://insaflu.insa.pt/), a free web-based bioinformatics platform for virus next-generation sequencing data analysis. Here, we considerably expanded its genomic surveillance component and developed a new module (TELEVIR) for metagenomic virus identification.

Results

The routine genomic surveillance component was strengthened with new workflows and functionalities, including: i) a reference-based genome assembly pipeline for Oxford Nanopore technologies (ONT) data; ii) automated SARS-CoV-2 lineage classification; iii) Nextclade analysis; iv) Nextstrain phylogeographic and temporal analysis (SARS-CoV-2, human and avian influenza, monkeypox, respiratory syncytial virus (RSV A/B), as well as a “generic” build for other viruses); and, v) algn2pheno (https://github.com/insapathogenomics/algn2pheno) for screening mutations of interest. Both INSaFLU pipelines for reference-based consensus generation (Illumina and ONT) were benchmarked against commonly used command line bioinformatics workflows for SARS-CoV-2, and an INSaFLU snakemake version was released. In parallel, a new module (TELEVIR) for virus detection was developed, after extensive benchmarking of state-of-the-art metagenomics software and following up-to-date recommendations and practices in the field. TELEVIR allows running complex workflows, covering several combinations of steps (e.g., with/without viral enrichment or host depletion), classification software (e.g., Kaiju, Kraken2, Centrifuge, FastViromeExplorer) and databases (RefSeq viral genome, Virosaurus, etc), while culminating in user- and diagnosis-oriented reports. Finally, to potentiate real-time virus detection during ONT runs, we developed findONTime (https://github.com/INSaFLU/findONTime), a tool aimed at reducing costs and the time between sample reception and diagnosis.

Conclusion

The accessibility, versatility and functionality of INSaFLU-TELEVIR is expected to supply public and animal health laboratories and researchers with a user-oriented and pan-viral bioinformatics framework that promotes a strengthened and timely viral metagenomic detection and routine genomics surveillance. INSaFLU-TELEVIR is compatible with Illumina, Ion Torrent and ONT data and is freely available at https://insaflu.insa.pt/ (online tool) and https://github.com/INSaFLU (code).

Molecular Epidemiology

Evolutionary Genetics

Virology

Bioinformatics

Epigenetics & Genomics

genomic surveillance

metagenomic detection

one health

virus

bioinformatics

web tool

Infectious diseases pose a constant and serious threat to human and animal populations. As such, modern surveillance systems should be able to detect and track the emergence and circulation of (new or variants of known) pathogens, as well as monitor their phenotypic and epidemiological relevant features. With the advances in high-throughput sequencing technologies, in particular, next-generation sequencing (NGS) by Illumina and Ion Torrent, whole-genome sequencing (WGS) rapidly became the method of choice for a fine resolution of pathogens’ genetic relatedness (either at inter- or intra-species level) and exploration of evolutionary and genomics features of interest, such as antimicrobial resistance and virulence traits [1–3]. The application of NGS for clinical microbiology, through non-targeted metagenomics, is another field in rapid expansion and poised to complement current gold standard diagnostic methods. More recently, the development of portable and more affordable sequencing equipment has contributed to accelerate the "universal" access to cutting-edge sequencing technologies, namely to the innovative third-generation sequencing by Oxford Nanopore Technologies (ONT) [4–6]. In the last few years, the COVID-19 pandemic and other recent international public health threats (e.g., the multi-country mpox outbreak, the A/H5N1 avian influenza global spread, etc.) [7, 8] have triggered this rampant technological transition, consolidating virus genome sequencing as the gold standard tool for outbreak detection and tracking. These advances are crucial to monitor circulating viruses and understand their evolutionary trajectories and phenotypic characteristics, with relevance for guiding diagnostics, prophylaxis and research [4–6, 9]. In this context, key stakeholders (World Health Organization, European Centre for Disease Prevention and Control, etc.) have leveraged strong national and international recommendations for the integration of genome sequencing data as an essential component of local and global surveillance systems [10]. In recognition of the interconnection between people, animals, plants, and their shared environment, the “One Health” concept, pathogen genomic surveillance capacity building should be extended to as many sectors as possible (e.g., clinical, food safety, and water sectors) and promote/facilitate data sharing [11]. This paradigm shift has been well reflected in an increasing awareness of the added value of genomic surveillance by decision makers (leading to increasing funding initiatives) [12], but especially in the great efforts of the scientific community to develop and share laboratory protocols and bioinformatics tools for sequence data generation and analysis, to reinforce (or establish new) intra- and inter-country laboratory networks, and promote training initiatives [13–15]. Nonetheless, despite the wide access to modern sequencing technologies, there are still huge discrepancies between countries, sectors and/or laboratories in the implementation of viral metagenomic diagnostics and routine genomic surveillance, often due to limited availability of: i) computational infrastructures and/or specialized personnel to process and interpret NGS data; ii) automated, standardized and scalable bioinformatics workflows for metagenomics-based pathogen detection and routine genomic surveillance; and/or, iii) tools for systematic and comprehensive integration of genomics data with clinical, demographic and epidemiological data [16, 17]. In addition, with the current possibility to conduct (near) real-time analyses during a sequencing run (possible with ONT), new bioinformatics tools are needed to accommodate and maximize such add value towards shorter turn-around times (and sequencing costs). Also extremely important in clinical diagnostics is the development of innovative bioinformatics strategies (both during analysis and reporting) to tackle the known high false positive rate associated with current taxonomic classification tools, towards a more accurate and unbiased metagenomic pathogen detection [18]. In order to catch up with this technological revolution, and aiming to increase capacity of laboratories with fewer resources, we have previously developed INSaFLU [19], an open and innovative web-based bioinformatics platform for virus NGS data analysis. On behalf of the One Health European Joint Programme (OHEJP) TELEVIR (https://onehealthejp.eu/jrp-tele-vir/) project, along with the development of wet-lab protocols [20], and also as a response to public health threats (namely, the COVID-19 pandemic and the multi-country mpox outbreak), we expanded its genomic surveillance component, and developed a brand new module (TELEVIR) for viral metagenomic identification. In this study, we present the upgraded INSaFLU-TELEVIR platform (https://insaflu.insa.pt/), a free, versatile and user-oriented “start-to-end” pan-viral bioinformatics framework aiming at facilitating or strengthening the laboratories capacity building in genomic epidemiology and public and animal health bioinformatics towards an enhanced and global surveillance of viral threats.

Viral metagenomic detection

Workflow overview and rationale

One of the main developments since INSaFLU’s first release [19] focused on upgrading the platform for automated metagenomic virus identification, in order to support both human and veterinary clinical practice and disease outbreak investigations. After reviewing the current state-of-the-art of the field of bioinformatics pipelines for metagenomic virus diagnostics [18,21–26] and consulting the TELEVIR consortium (Public Health and Veterinary institutes across all Europe), a modular pipeline was designed and developed, incorporating the key steps of NGS metagenomics taxonomic classification and reporting (Figure 1 and Figure 2), namely: read quality control, viral enrichment / host depletion, de novo assembly, reads/contigs taxonomic classification, confirmatory reference-based remapping and reporting. The choice of the internal components of the implemented workflows (software, default parameters, etc.) resulted from an extensive benchmarking (next section). Details of TELEVIR resources, benchmarking and implementation are detailed in Additional file 1. In summary, the input/output flow and main functionalities behind the main TELEVIR steps are as follows:

Read quality analysis and improvement: This step takes the input single- or paired-end reads (fastq.gz format; Illumina, Ion Torrent or ONT) and produces quality processed reads, as well as quality control reports for each file, before and after this step. This step is performed automatically following sample upload and thus overlaps the two components (virus detection and genomic surveillance) of the INSaFLU-TELEVIR platform. Quality filtering and trimming of Illumina reads is performed as described in Borges et al. (2018) [19], treatment of ONT data is described below. An optional, extra filtering layer that targets low complexity reads is available as part of the TELEVIR pipeline using the software PRINSEQ [27]. Parameters are modifiable by the user.

Viral enrichment: This step retains potential viral reads based on a rapid and permissive classification of the reads against a viral sequence database. This step is performed directly over raw reads (if QC was turned OFF) or quality processed reads (if QC was turned ON).

Host depletion: This step removes potential host reads based on reference-based mapping against host genome sequence(s). Mapped reads are treated as potential host reads and removed. This step will act on virus enriched sequences, unless the viral enrichment step was turned OFF, in which case host depletion will be directly performed over raw / quality processed reads. Several host sequences are provided as default.

De novo assembly: This step performs de novo assembly using reads retained after the "Viral enrichment" and/or "Host depletion" steps. If the latter steps were turned OFF, assembly will be directly performed using raw / processed reads. Assembled contigs are automatically filtered for a minimum sequence length.

Identification of viral sequences: This step screens reads and contigs against viral sequence databases, generating an intermediate read and/or contig classification report: a list of viral hits (taxonomic identifiers - TAXID, and representative accession identifiers - ACCID) potentially present in the sample. TAXIDs bearing the keyword “phage” in their scientific name are filtered out.

Selection of viral TAXID and representative genome sequences for confirmatory analysis: In this step, the previously identified viral hits (TAXID) are selected for confirmatory mapping against reference viral genome(s) present in the available databases. Viral TAXIDs are selected, up to a maximum number of hits, under the following order: i) Viral hits corresponding to phages are removed from classification report; ii) TAXIDs present in both intermediate classification reports (reads and contigs) are selected; iii) additional TAXIDs are selected across the read classification report and contigs classification report by number of hits, in decreasing order, and total length of matching sequences, when available, until reaching the defined maximum number of hits to be selected (this number is to be user-defined). Finally, TAXIDs are queried against available databases for associated ACCIDs.

Remapping of the viral sequences against selected reference genomes: This step maps reads and contigs against representative genome sequences (ACCIDs) of the selected viral TAXIDs collected in the previous step. Reads are also mapped against the set of contigs classified for each TAXID. Of note, TAXIDs that were not automatically selected for this confirmatory remapping step (but that were present in the intermediate reads and/or contigs classification reports) can still be user-selected for mapping at any time. An optional, extra layer of “mapping stringency” was added to this step to minimize false positive hits, allowing users to set a maximum sum of the mismatch qualities before marking a read unmapped and a maximum fraction of nucleotide mismatches allowed before soft clipping from ends. This additional layer is optional and disabled by default.

Reporting: The workflow culminates in user-oriented reports on a list of the top suspected viruses (detailed in Usage section), each accompanied by several diagnostic-oriented metrics, statistics and visualizations, provided as (interactive) tables (intermediate and final reports), graphs (e.g., coverage plots, Integrative Genomics Viewer visualization, Assembly to reference dotplots) and multiple downloadable output files (e.g., list of the software parameters, reads/contigs classification reports, mapped reads/contigs identified per each virus; reference sequences, etc.). To further help the user in assessing the validity of the reported hits in a given sample, viral references are grouped by mapping overlap, as measured by the number of shared mapped reads. This grouping is capable of placing together true positive hits with their corresponding cross-mapped potential false positives, allowing for the easy identification of the latter. Grouping parameters are modifiable in the Reporting section of the TELEVIR Settings Menu for both sequencing technologies.

In the context of metagenomics in clinical virology, cross-mapping of reads across several host and pathogen reference sequences is very common, resulting in a high false positive rate [28–30]. The TELEVIR workflow provides a light-weight but robust approach, in classification and interpretation, to false positives. Firstly, it follows suggestions expressed in the literature to first filter out reads enriched with low complexity regions (e.g., homopolymeric tracts or short-repeats), as well as unwanted material (host or non-viral “contaminants”) through host depletion and / or viral enrichment [21,26,31]. These steps aim at decreasing background noise and increasing the performance [26,32,33] and the speed of both read classifiers and assemblers. In turn, besides the reads classification, the workflow takes advantage of increased precision of contig classification [34], which provides an additional, robust metric with which to assess the validity of the final results. This pipeline further innovates in tackling false positives by introducing a final confirmatory analysis that comprises automatic reference selection, remapping (including optional “mapping stringency” settings), hits grouping and warning flagging. By normalizing the classification reports and outputs for comparison, the interactive reports provide a uniform basis on which to confirm classifications and weed out false positives.

TELEVIR benchmarking

In order to finely explore the best approaches to be implemented in the TELEVIR toolbox, we benchmarked several workflows. For this, we tested combinations of the key modules that comprise the overall virus identification pipeline: Viral Enrichment, Assembly and Classification (of reads and contigs). Specifically, we tested software (such as Centrifuge, Diamond, Kaiju, Kraken2, KrakenUniq, BLAST) (Supplementary Table 1) and databases (such as NCBI, RVDB, UniProt, Virosaurus) (Supplementary Table 2) commonly used in virological diagnostic laboratories for clinical metagenomics, as well as some more recent but promising alternative classifiers (deSAMBA, FastViromeExplorer, Clark) (see mode details in Additional File 1 - Resources) In some instances, we also evaluated software performance by varying argument values (Supplementary Table 3). We further benchmarked the sorting algorithm used to rank candidate reference hits based on the results of the Read and Contig classification steps. For ONT data, we ran 117 combinations (i.e., different software, reference databases and/or parameters) on 20 samples (a total of 2340 runs). For Illumina data, we ran 108 combinations on 24 samples (2592 total runs). The reads used in the benchmark (Supplementary Table 4) covered a wide range of viruses (including influenza and SARS-CoV-2, but also bluetongue and epizootic hemorrhagic disease virus, among others) and hosts (including human, various ungulates and one culicoides specimen), and include a dataset of clinical samples from patients with encephalitis or viral respiratory infections, previously used to benchmark software for metagenomic virus diagnostics [24].

The design, methodological details and results of this extensive benchmark are described in the Additional file 1 (covering Supplementary Tables 1-7 and Supplementary Figures 1-5).In summary, we found that combining the information from contig and read classification in order to rank metagenomics candidate hits is indeed preferable than depending on a single classification source (Supplementary Figure 1). Regarding software selection: at the Viral Enrichment step, Kraken2 and Centrifuge performed the best for Illumina and ONT technologies, respectively, based on precision (Supplementary Figure 3 A-B); at the Contig classification step, we found that Nucleotide BLAST resulted in the highest number of successfully mapped contigs (Supplementary Figure 3 C-D); At the Read Classification step, we found significant differences in precision between several software for ONT, but not for Illumina (Supplementary Figure 3 E-F). In the end, our choice of software (Supplementary Table 8) reflected a trade-off between benchmark results at the module level (Supplementary Figure 3) and in combination (Supplementary Figure 4), providing the user with adequate cross-validation, and the constraints of implementing new software on an existing platform (see Additional file 1 - Section 4.1).

findONTime,a complementary tool to enable real-time metagenomics virus detection

When performing hypothesis-free viral diagnosis by sequencing complex biological samples, the proportion of the virus in a sample is unknown. As such, the amount of sequencing data, and, consequently, the run length needed to accurately detect a virus cannot be predicted a priori. These result in sequencing runs often being allowed to run overnight, at the expense of the material and, in the context of diagnostics, the potential detriment of patient or animal status. Inspired by existing examples in the field for real-time ONT targeted mapping and overview of genome coverage (e.g. RAMPART; https://artic.network/rampart), we envisaged a tool for continuous ONT run monitoring in the context of viral metagenomics that would allow users to cut short a sequencing run when sufficient pathogen sequence evidence has been gathered. As such, we developed findONTime (https://github.com/INSaFLU/findONTime), a command-line tool complementary to INSaFLU-TELEVIR platform that potentiates a time and cost effective real-time viral metagenomic detection. findONTime is a multi-threaded python package that runs concurrently with MinION sequencing to: i) monitor the demultiplexed FASTQ files (gzipped or not) that are being generated in real-time for each sample (the sequencing run should have the barcoding option ON); ii) merge the same-sample files (at user-defined time intervals), downsize them (on demand) and prepare a metadata table (according to the INSaFLU-TELEVIR template); and, if requested, iii) upload these files (ONT reads and metadata) to the INSaFLU-TELEVIR platform (local server via SSH or directly through docker, depending on a user provided configuration file); and, iv) launch the metagenomics virus detection analysis using the TELEVIR module. findONTime (https://github.com/INSaFLU/findONTime) is implemented in python 3.9 and is pip-installable (https://pypi.org/project/findontime/).

Routine genomic surveillance

- Reference-based genome assembly

With the recent advances in third-generation sequencing technologies (ONT) and their wider access through more portable and affordable equipments (MinION), it became necessary to deploy a reference-based genome assembly pipeline for ONT data analysis in the INSaFLU-TELEVIR platform, in addition to the existing workflow for Illumina and Ion Torrent data [19]. In order to keep the same dashboard usability across technologies (see Usage section), the implemented ONT workflow followed the same pipeline architecture (from raw reads to quality analysis, reference-based generation/curation of consensus sequences and mutation detection) and input/output flow and formats [19], but relying on open-source software specifically adapted to the characteristics of ONT data. First, sequencing technology (ONT or Illumina/Ion Torrent) is automatically inferred from the distribution of read lengths, upon read upload. Samples classified as ONT are QC filtered using NanoFilt [35], and statistics and reporting is generated using NanoStat [35] and RabbitQC [36]. Default parameters for NanoFilt, regarding average read quality, minimum/maximum read length and start/end trimming size (Supplementary Table 8), were selected to provide a trade-off between quality and read length, but are open to user configuration to fit to sample characteristics and the upstream experimental conditions, etc. Post-QC reads are then processed by medaka (https://github.com/nanoporetech/medaka) using “consensus” and “variant” modes to generate raw consensus sequences (FASTA) and mutation lists (VCF), respectively. After a calculation of depth of coverage per site, mutations present in the raw VCF files are filtered out based on user-configurable criteria: i) minimum depth of coverage per site (--mincov) (default: 30); and, ii) minimum proportion for variant evidence (--minfrac) (default: 0.8). Intermediate consensus sequences are then generated using bcftools [37] based on the VCF file containing the validated mutations. The last step of consensus sequence curation involves the automatic placement of undefined nucleotides (“N”) in: i) low coverage regions (i.e., regions with coverage below --mincov), using “msa_masker (https://github.com/rfm-targa/BioinfUtils/blob/master/FASTA/msa_masker.py); ii) mutations with frequencies between 50% and the user defined “--minfrac”; iii) regions (or sites) selected to be masked by the user (e.g., regions falling outside the amplicon schema). Steps i) and iii) of consensus curation were also incorporated in the existing Illumina/Ion Torrent workflow [19], which is also similar in all subsequent steps of mutation annotation (using snpEff) [38], alignment (using MAUVE and MAFFT) [39,40] and rapid phylogenetics (using FastTree) [41], as previously described [19].

INSaFLU benchmarking

The INSaFLU reference-based genome assembly pipeline for Illumina data analysis was previously benchmarked for influenza virus [19] using the IRMA pipeline [42] for comparison. In the present study, we performed additional benchmarking for SARS-CoV-2, comparing INSaFLU with a commonly used command-line bioinformatics workflow (https://github.com/andersen-lab/HCoV-19-Genomics), involving BWA for reads mapping [43] and iVar (https://github.com/andersen-lab/ivar; https://andersen-lab.github.io/ivar/html/manualpage.html) for QC and consensus generation [44]. The newly implemented INSaFLU pipeline for ONT data was also benchmarked against the widely used ARTIC SARS-CoV-2 pipeline (https://github.com/artic-network/fieldbioinformatics/). The methodological details and results of the two benchmarks are described in Additional file 2. In summary, for Illumina, the comparative analysis of the obtained consensus sequences using INSaFLU versus BWA/iVar workflow demonstrated similar performance by both pipelines, but underlined the expected added value of incorporating an extra step of targeted primer clipping from the BAM file, as implemented in iVar (Additional file 2). In this context, the specific iVar primer clipping step (including primer trimming from aligned reads, as well as removal of aligned reads containing minor variants matching primer sequence but differing from the consensus sequence) was incorporated in both Illumina and ONT pipelines. The primer schema (used for amplification) is selected in the dashboard by the user upon Project creation and before adding samples. The final benchmark results using the upgraded INSaFLU workflows confirmed that the INSaFLU consensus generation performs similarly to widely used Illumina and ONT pipelines (detailed discussion in Additional file 2).

- Additional implementation of surveillance-oriented functionalities

In parallel to the refinement of the reference-based genome assembly pipelines, we integrated other important surveillance-oriented (often virus-specific) functionalities and features into the platform. The “References” default database (open to all users) was continuously enriched with genome sequences relevant for surveillance of viruses other than influenza, namely SARS-CoV-2, RSV and MPXV. The step of rapid virus identification/classification upon reads upload, as originally described [19], was also strengthened by accommodating ONT data, through draft assembly using Raven [45], and by upgrading the database of genetic markers to rapidly identify the presence of Human Betacoronavirus (namely, HCoV-OC43, HCoV-HKU1, MERS, SARS, and SARS-CoV-2), RSV A and B, as well as MPXV. Pangolin (https://github.com/cov-lineages/pangolin) [46] was incorporated for SARS-CoV-2 Pango lineage classification (default settings, usher mode), with pango software and databases automatically updated on a daily basis to provide up-to-date (re-)classification of new and old sequences. As a complement, direct hyperlinks to Nextclade (https://clades.nextstrain.org/) are automatically provided for rapid and flexible clade/lineage classification and quality analysis of SARS-CoV-2, seasonal influenza, MPXV and RSV consensus sequences (INSaFLU consensus sequences are directly analyzed at client side on the browser). Other main improvements of the surveillance component involved the incorporation of Nextstrain (https://nextstrain.org/) [47], and the development and integration of algn2pheno (https://github.com/insapathogenomics/algn2pheno), as described in the next sections.

Integration of Nextstrain phylogeographic and temporal analysis

Nextstrain (https://nextstrain.org/) [47] relies on reproducible and open-sourced pathogen-specific workflows for genomic data curation, analysis and visualization of integrated phylogeographical and temporal data, towards a real-time tracking of pathogen evolution. Hence, we strengthened the INSaFLU-TELEVIR surveillance component by integrating Nextstrain phylogeographic and temporal analysis of several viruses, namely SARS-CoV-2, seasonal and avian influenza, MPXV and RSV. We performed minor adjustments to the original Nextstrain workflows in order to: i) change the input source so that the implemented workflow incorporates user-provided sequences (via INSaFLU or by direct user upload) instead of fetching from databases; ii) relax some of the sequence filtering (more important when fetching from external databases) to maximize the number of input sequences included in the final tree (most often consensus from INSaFLU projects that already passed user-provided quality filters); and, iii) reduce input/output complexity, by removing some pathogen-specific inferences that require metadata that may not always be available (e.g., clinical onset date). For example, in the specific case of seasonal influenza, we removed fitness inferences and allowed more ambiguity and divergence among input sequences. Moreover, users may want to analyze organisms for which there is no specific Nextstrain build available. To cover these cases, we included a generic build (with or without temporal data) which performs alignment and tree-building using augur [48], either with the user-provided reference as root (when temporal data is not provided), or inferring the root and molecular clock from user-provided temporal sample metadata. The generic builds and the INSaFLU-adapted species-specific Nextstrain workflows are kept in a separate repository available at https://github.com/INSaFLU/nextstrain_builds. We regularly update our workflows with changes to the repositories of the original workflows, namely, e.g., with information regarding clades.

algn2pheno-screening of potential genotype-phenotype associations

In the course of the evolution of any given pathogen, multiple mutations and combinations of mutations are continuously arising. Although many/most of these mutations usually have no effect or relevance, sometimes it is possible to associate some mutations with specific phenotypes or characteristics (such as antiviral resistance, resistance to neutralizing antibodies, enhanced affinity to host-receptors antibodies or enhanced transmissibility, etc.), when rich epidemiological, clinical and/or biological data are available. In this sense, in the context of viral genomics surveillance, it is crucial to be able to rapidly detect and report such known mutations of interest in genomic sequences. As such, we developed algn2pheno (https://github.com/insapathogenomics/algn2pheno), a tool that screens an amino acid or nucleotide alignment against a given "genotype-phenotype" database. algn2pheno is implemented in the routine genomic surveillance module of INSaFLU and automatically screens SARS-CoV-2 Spike amino acid alignments in each SARS-CoV-2 project against three default “genotype-phenotype” databases: the COG-UK Antigenic mutations (https://sars2.cvr.gla.ac.uk/cog-uk/), the Pokay Database (https://github.com/nodrogluap/pokay/tree/master/data) and a database of mutations in Spike epitope residues compiled by Carabelli and colleagues [49]. algn2pheno detects all the mutations in each sequence, maps the mutations to the three databases and generates final reports with the repertoire of mutations of interest present in each sequence and their potential linkage to specific phenotypes.

This tool is also available as a standalone command line tool (https://github.com/insapathogenomics/algn2pheno) and was designed for flexibility in adaptation to different pathogens and customized databases. Users must provide an amino acid / nucleotide alignment including the sequences under analysis (and the the reference sequence, as mutation numbering will refer to this sequence) and a “genotype-phenotype” database in either tab separated (.TSV) or excel (.XLSX) format. algn2pheno will generate two main outputs (among other useful intermediate files): i) a final report (tabular format with a row per sequence) that lists all the “flagged mutations” (i.e., mutations in the database that were identified in the sequences), the phenotypes associated with those mutations and a list of all the mutations in each sequence; and, ii) a binary matrix with the mutations and the "associated" phenotypes identified in all sequences. algn2pheno is implemented in Python 3.9 and is freely available at https://github.com/insapathogenomics/algn2pheno (including usage examples and detailed output description).

Installation and software availability

Easier installation using docker

Although the INSaFLU-TELEVIR website is freely available for public use, it might become limiting when robust high-speed internet is not easily available (e.g., for use in the field), when high volumes of data are uploaded (subjected to queue and computational constraints) as well as when there are any other limitations (e.g., institutional, legal, ethical, etc.), preventing the upload of sequence and/or descriptive data to external servers. It is thus essential that the INSaFLU-TELEVIR platform also be made available locally. Although fully based on open source software, it depends on many different tools, making it relatively complex to install and configure manually. To facilitate the local installation of INSaFLU-TELEVIR, we used the docker containerization system [50] to automate the installation process, making it possible for users with limited informatics knowledge to install the system. Users just need to install docker in their system, download the INSaFLU docker from github (https://github.com/INSaFLU/docker) and run a small number of commands. During this process, users can also decide not to commit to installing the virus detection module (TELEVIR) if they only need the routine genomic surveillance components (INSaFLU and Nextstrain). Although designed to be installed on a computer running a Unix-like operating system, the docker installation can also be used in a Windows platform (e.g., a laptop) using WSL, as long as sufficient computational resources are available. The minimum recommended RAM is 32G, if the virus detection module is installed, but can go down to 16G if only the genomic surveillance system is installed. One recommended use-case is the installation (e.g., by a (bio)informatician) of an INSaFLU-TELEVIR instance to be shared within an institution. Also, when a local docker instance is available, findONTime (see previous sections) can be used to upload reads to the local instance, and automatically create and run viral detection projects, reducing hands-on time.

A snakemake workflow to facilitate execution in a compute cluster

The main driver for the development of the INSaFLU-TELEVIR website was the empowerment of less capacitated laboratories, facilitating the implementation and usage of bioinformatics workflows for viral metagenomic diagnostics and routine genomic surveillance. Nonetheless, it may become cumbersome to use the INSaFLU web-based interface with a very large number of samples. Also, some laboratories may want to use the analysis pipelines available in INSaFLU through internal computational infrastructures, such as compute clusters. Moreover, the website is not a practical testbed for the development of new functionalities or integration of alternative approaches to the existing pipelines. To cater for these needs, we have implemented the functionality of the genomic surveillance module of INSaFLU as a Snakemake [51] workflow. We make use of Snakemake’s support for conda to facilitate installation of external software, and its slurm support to facilitate execution in compute clusters. The workflow is available at https://github.com/INSaFLU/insaflu_snakemake, including instructions on how to use it. Using the benchmark datasets described above, the INSaFLU snakemake workflow produced the same consensus sequences as the public website (Additional file 2).

Implementation of viral metagenomic diagnostics and routine genomic surveillance can be particularly challenging due to the lack of computational infrastructures, tools and/or bioinformatics expertise. In order to face the latter challenge, we have previously developed and openly released INSaFLU (https://insaflu.insa.pt/) (Borges et al., 2018), a user-friendly bioinformatics suite for virus NGS data analysis. In the present study, we developed a new module (TELEVIR) for metagenomics virus identification, and considerably expanded and reinforced its genomic surveillance modules. Currently, INSaFLU-TELEVIR (https://insaflu.insa.pt/) is an open web-based (but also locally installable; https://github.com/INSaFLU/docker) bioinformatics platform for virus metagenomic detection and routine genomic surveillance that can be freely accessed upon account creation (user-restricted accounts). It can handle NGS data (single and/or paired-end data) obtained from different technologies (Illumina, Ion Torrent and ONT), and derived from diverse wet-lab protocols (amplicon-based workflows, shotgun metagenomics, etc.) and library preparation/sequencing kits. It integrates two main analytical components: i) a virus detection pipeline: from NGS reads to quality control and metagenomics virus identification and reporting; ii) a reference-based genomic surveillance pipeline: from NGS reads to quality control, mutations detection, consensus generation, virus classification, alignments, “genotype-phenotype” screening, phylogenetics and integrative phylogeographical and temporal analysis, etc (Figure 1). An up-to-date documentation providing extensive usage example of data upload, analysis and management, pipeline details (complementing the code available at https://github.com/INSaFLU/INSaFLU) and an extensive tutorial on how to upload data, run analysis and visualize/download graphical and sequence/phylogenetic outputs is available at https://insaflu.readthedocs.io/en/latest/, since its first release [19].

Usage

Following the original web interface architecture [19], the upgraded INSaFLU-TELEVIR dashboard and functionalities are organized in four main interactive tabs: Settings, References, Samples and Projects.

The Settings menu (new feature since Borges et al., 2018)[19] is organized, when applicable, by module (Quality Control, INSaFLU and TELEVIR), NGS technology, pipeline step, software and parameters. This menu should be consulted to change specific software or controlling workflows, in order to fit the desired bioinformatics pipeline to the user’s needs, sample characteristics and/or the upstream experimental conditions. For example, the default reads end’s trimming size may be too permissive or restrictive depending on the laboratory protocol (e.g. tiling amplicon multiplex PCR) and/or on sequencing settings or bioinformatics that were applied upstream (e.g., if reads are or not already trimmed/clipped before upload). The workflow and parameters selected in the global Settings menu are applied to the user account as a whole (i.e., to new samples and projects), but specific settings can be modified later on for individual samples or projects, in the respective menus. The References menu includes publicly available sequences (from NCBI or made available in INSaFLU under permission from the authors) to be used for reference-based genome assembly through “INSaFLU” projects (see below). It has been continuously enriched with sequences relevant for surveillance of viruses of interest, namely influenza, SARS-CoV-2, MPXV and RSV. Similarly to the first platform version, additional reference files (FASTA and GenBank) can be uploaded to the user-restricted account. The Samples menu is the main sample repository, in which NGS reads (fastq.gz format), as well as the sample contextual data (i.e., metadata table in “csv” or ”tsv” format, according to downloadable templates), are uploaded (through single upload or batch upload) or deleted. This menu also provides read quality reports, technology identification and rapid classification data (all automatically provided after upload), as previously described [19]. The main Projects menu allows access to three types of scalable projects: TELEVIR projects (for virus detection), INSaFLU projects and Nextstrain datasets (both for virus routine genomic surveillance). The usage and functional and reporting features of these three main analytical modules are described below.

Metagenomics virus detection

TELEVIR projects - From reads to virus detection

Our benchmarking results consolidated the expectation that there is no “one-size-fits-all” bioinformatics approach that can detect all viruses, but instead a set of “well-performing” workflows that together can potentiate the detection of clinical relevant viruses, as described in the implementation section (and detailed Additional file 1 and Figure 3). As such, the TELEVIR dashboard was designed to accommodate this flexibility by allowing users to simultaneously select complex workflows (covering several combinations of classification algorithms, databases and parameters) in a user-friendly manner through the TELEVIR Settings pages. Controlling workflows is done by selecting/deselecting which software (and their parameters and/or databases, when permitted) are to run at each step of the pipeline (summarized in Figure 2). Some key pipeline steps (e.g. confirmatory re-mapping) cannot be turned OFF. Other cases are context dependent: de novo assembly cannot be turned OFF if Contig Classification is turned ON; at least one classification step must be turned ON (Contig Classification may not be turned OFF if Read Classification is already OFF, and vice-versa).

In parallel, efforts were employed to develop and implement user-friendly (visual) solutions for output reporting. As the interpretation of metagenomics virus detection data is not a trivial task (even for users with expertise in virology and/or bioinformatics), the design of the TELEVIR output dashboard gave emphasis not only to increasing report accessibility and interpretation, but also to facilitating output navigation and promoting decision-making on the part of the users (especially relevant in clinical virology). Targeting these goals, TELEVIR reports are generated per workflow, per sample (combining several workflows) and per project (combining several samples), with a decreasing level of detail. Workflow reports are organized as dynamic and interactive “expand-and-collapse” panels that allow the visualization/download of relevant intermediate tabular (e.g., list of the software parameters, list of viral hits classified from reads and/or contigs) and sequence output data (e.g., reads surviving the viral enrichment and/or host depletion steps) generated throughout each workflow step (listed in Supplementary Table 8; Figure 4). Ultimately, each workflow culminates in a main report (interactive table) with a list of the detected top-viral hits, each one accompanied by several robust and diagnostic-oriented metrics, statistics and visualizations(also detailed in https://insaflu.readthedocs.io/en/latest/), provided as (interactive) tables, graphs (e.g., coverage plots, Integrative Genomics Viewer visualization, assembly to reference dotplot) and multiple downloadable output files (e.g., mapped reads/contigs identified per virus; reference sequences, BAM files, etc) (Supplementary Table 8; Figure 4). In brief, the reported hits are identified (as detailed in Implementation section), up to a user-defined maximum number of hits, as follows: reads and contigs (if available) are classified independently, then viral hits (TAXID) detected in both intermediate classification reports (reads and contigs) and/or within the top list from each side are selected for reference-based mapping against viral genome sequences present in the available databases. In summary, the main tabular report only includes viral hits (listed by the reference NCBI ACCID, with direct interactive link to NCBI webpage) that were classified at reads and/or contig level (“classification success”) and that had mapped reads or contigs (“mapping success”). Other viruses (TAXID) that were not automatically selected for confirmatory remapping are flagged as “Unmapped” and can be user-selected for mapping at any time through the bottom panel “Raw Classification and Mapping Summary” (which also lists hits yielding zero mapping). This functionality allows users to confirm/exclude the presence of a suspected virus (e.g., virus compatible with the animal / human clinical status) that did not meet the criteria for inclusion in the final report (e.g., due to their insufficient number of hits in the intermediate reports). Sample reports (interactive and downloadable tables) compile all viral hits identified in the main reports of the several workflows that were run for each sample, in which redundant hits are excluded (Figure 4). Finally, Project results are provided as simple tables combining all top viral hits identified in the main reports of the several workflows that were run for all samples included in the project. Both Sample and Project reports provide direct links to the detailed reports generated at workflow level for an enhanced sample comparison and output interpretation.

Besides the availability of multiple reports and downloadable files (summarized in Supplementary table 8), other important features were incorporated in the TELEVIR dashboard and documentation to facilitate the detection, evaluation and/or resolution of specific situations or confounding factors commonly faced during metagenomic NGS in clinical virology, in light of recent recommendations [18,24,25,52].

Negative and positive controls. The inclusion of negative controls (e.g., pathogen-negative samples, library preparation buffers) is highly recommended to identify sources of potential contamination and detect false positive hits [24,25,52–56]. Indeed, viral taxa/sequences detected in the test samples that are also present in the negative run controls should be interpreted as contamination or background noise (e.g., nucleic acids present in reagents might yield false positive viral hits across test and control samples). In addition, the inclusion of positive controls (e.g., samples spiked with viruses that cannot be found in the organism or in the environment that is being investigated) is also commonly performed to control the success of nucleic acids extraction, preparation and sequencing [18,24,25,52]. As such, TELEVIR users are encouraged to create different projects for different metagenomics sequencing runs, as they are allowed to select “control” sample(s) at any time (before and after data analysis) for each project. Viral TAXIDs detected in the main report of the user-selected “control” sample(s) are automatically flagged as “Taxid found in control” in the reports of samples in the same project. This functionality is designed to facilitate the background subtraction of viral hits also found in controls.

False-positive viral hits. In the context of diagnostics, false-positive bioinformatics classification results can have significant consequences for patient / animal care [18,24,25,52]. As such, TELEVIR reports provide specific warnings for two bioinformatics “artifacts” commonly yielding false-positive virus assignments: i) “Vestigial Mapping” warning: when only a vestigial number of reads is mapped; ii) “Likely False Positive” warning: when most reads map in a very small region of the reference sequence, i.e., hits with high “DepthC” (mean depth of coverage exclusively in the covered regions) but low “Depth” (mean depth of coverage throughout the whole genome) and low “Cov (%)” (horizontal coverage) (specific flag criteria are detailed in Supplementary Table 8). Of note, during benchmarking and testing, we noticed that both situations are often due to low complexity regions (e.g., homopolymeric tracts or repeat regions). In this regard, an extra optional step of reads filtering by sequence complexity (using PrinSeq++) [27] was added to the Pre-processing step.

Multiple hits for several closely-related viruses. Cross-mapping of reads across several viruses (TAXID) with considerable nucleotide homology, such as viruses belonging to the same family, is very common in viral metagenomics. The interpretation of these cases is expected to be facilitated by the fact that the virus actually present in the sample is likely more closely related to the reference virus (TAXID) yielding the best TELEVIR mapping metrics (see Supplementary Figure 5), but extra manual inspection (namely, BLAST of mapped reads/contigs and IGV inspection) is recommended (see documentation and literature [24,25,52]. To further facilitate the report interpretation, viral hits included in the main reports (at both “workflow” and “sample” levels) are grouped and sorted by the degree of overlap of cross-mapped reads, as detailed in implementation. In addition, an optional and flexible step of “mapping stringency” is available to facilitate the detection of reads with high homology to the reference. Of note, by design, a true positive viral detection in TELEVIRwill normally yield multiple hits for the same virus (TAXID). Two main situations justify this output: i) the presence of segmented viruses in the sample (usually each reference segment has different ACCIDs, so they are reported as independent hits); ii) the availability of several reference genomes (strains or variants) of the same virus in the databases. As above, in the latter situation, the virus present in the sample is likely more closely related to the reference genome (ACCID) yielding the best mapping metrics. The sorting strategy described above is expected to largely facilitate the report interpretation in these cases.

Although the INSaFLU-TELEVIR platform takes advantage of several viral reference databases, these do not cover all viruses. For instance, newly discovered or uncommon viruses or viral strains (e.g., viruses without available complete genomes in the databases) might be missing, leading to false negative results. Moreover, the ultimate goal of the TELEVIR module is to detect viruses (specially clinically relevant viruses), and not necessarily to identify the virus “strain/variant/serotype”. Once a given virus is detected, users are encouraged to perform fine-tuned analyses (e.g., consensus sequences reconstruction, mutation detection, etc.) using the classical INSaFLU projects (see below) to better characterize the virus found.Ultimately, in order to facilitate and strengthen the TELEVIR output interpretation and decision-making from the part of users, we highlight the availability of extended user guidance on how to interpret TELEVIR reports and exclude/confirm viral hits, by exemplifying “expected” metrics profiles (or combination of profiles) when there are different levels of evidence for the presence of a given virus in metagenomic NGS data analysed through TELEVIR (https://insaflu.readthedocs.io/en/latest/).

As described in the Implementation section, apart from the development and release of the TELEVIR module, we released findONTime (https://github.com/INSaFLU/findONTime), which is a complementary tool designed to run concurrently to MinION sequencing towards a more timely and cost effective real-time metagenomics virus detection using the INSaFLU-TELEVIR platform. Indeed, by automating the input preparation (ONT reads and metadata) and TELEVIR deployment, findONTime potentiates the detection of a virus in a sample as early as possible during the sequencing run, reducing the time gap between obtaining the sample and the diagnosis, and also reducing sequencing costs (as ONT runs can be stopped at any time and the flow cells can be cleaned and reused). As a proof-of-concept exercise, we ran the findONTime over ONT data of a MPXV-positive sample (regarding the first 2022 outbreak genome described in Isidro et al., 2022)[8] that was subjected to MinION shotgun metagenomics after DNA extraction without any virus enrichment / host-depletion laboratory treatment. As shown in Figure 5, simulating a context of hypothesis-free ONT sequencing, this approach would allow us to get early sequence evidence for a rapid, robust and less costly diagnosis. Indeed, although the proportion of MPXV reads was no more than 1%, strong sequence evidence was reached in less than two hours, namely MPXV classification in both reads and contigs just after 40 min or more than 90% of MPXV reference genome covered by at least one read at 1h 20 min of run time. findONTime can be used as a “start-to-end” solution or for particular tasks (e.g., merging ONT output files, metadata preparation and upload to a local INSaFLU-TELEVIR instance). Usage examples are provided in https://github.com/INSaFLU/findONTime#usage.

Routine genomic surveillance

The surveillance-oriented component of the platform dashboard is divided into:

1. INSaFLU Projects - From reads to reference-based generation of consensus sequences and mutation annotation/screening, followed by gene- and genome-based alignments, amino acid alignments, classification, NextClade link, etc.

The COVID-19 pandemic and other recent international public health threats (e.g.,the multi-country mpox outbreak, the A/H5N1 avian influenza global spread, etc) have contributed to accelerate the “universal” access to modern sequencing technologies, in particular to portable third-generation sequencing equipments (MinION). As such, to keep following this technological revolution in the field of genomic surveillance of viral diseases, we have put particular efforts to improve and adapt the surveillance-oriented component of the INSaFLU-TELEVIR platform so that it could handle ONT sequence data of multiple viruses [besides Illumina and Ion Torrent data, as described in the first release [19]. The developed pipeline incorporates software for ONT reads quality control, reference-based mapping, mutation calling and consensus generation, and performed similarly to the widely used ARTIC SARS-CoV-2 pipeline (https://github.com/artic-network/fieldbioinformatics/), as detailed in the benchmarking results of the Implementation section. We privileged a very smooth integration of the new ONT pipeline into the existing dashboard [19], in order to minimize the impact on its usability and promote data analysis flexibility. In brief, the updated INSaFLU projects can process samples from the different sequencing technologies, which are automatically detected upon reads upload and automatically guide the pipeline to be run, without further user interaction. All upstream INSaFLU analyses (e.g., mutation annotation, alignments, phylogenetics, etc) and outputs (content and format) (e.g., tabular list of mutations and its annotation, etc) were kept similar to the existing Illumina/Ion Torrent pipeline in order to facilitate sequence comparison regardless of the technology used. This harmonization and flexibility is particularly useful, for instance, in the context of routine genomic epidemiology systems with centralized data analysis, but decentralized sequencing with distinct technologies.

In addition to the integration of the reference-based genome assembly pipeline for ONT data, the INSaFLU projects were upgraded with other important surveillance-oriented (often virus-specific) functionalities and features, including: i) integration of automatic SARS-CoV-2 Pango lineage assignment (https://pangolin.cog-uk.io/) using Pangolin (https://github.com/cov-lineages/pangolin) [57]. To better fit this dynamic lineage nomenclature, whenever new software/database versions are released (automatically checked daily), a button “Update Pango lineage” is automatically made available, so that users can re-assign all project samples using the latest software/database versions; ii) integration of direct links to Nextclade (https://clades.nextstrain.org/) for rapid and flexible SARS-CoV-2, seasonal influenza, MPXV and RSV consensus sequences analysis (at client side on browser). This feature allows INSaFLU-derived consensus sequences to be easily subjected to quality screening, lineage/clade/genotype classification, mutation exploration and other relevant analyses available at the Nextclade framework; iii) incorporation of the newly developed “algn2pheno” (see implementation) for automatic screening of SARS-CoV-2 Spike amino acid alignments against “genotype-phenotype” databases of mutations of potential biological or epidemiological interest; iv) improvement of existing features for phylogenetic trees visualization using PhyloCanvas (https://github.com/phylocanvas) to easily colour tree nodes and to display coloured metadata blocks next to the phylogenetic trees nodes, thus facilitating integration of relevant epidemiological and/or clinical data and pathogen genomic data; and, v) inclusion of novel “expand-and-collapse” panels for an interactive report of all detected mutations (including detailed information about genome position, nucleotide change, coverage evidence, frequency, impact at protein level, etc.), the mean depth of coverage and horizontal coverage per locus for all samples through intuitive color-coded buttons and an “algn2pheno” report of mutations of interest.

2. Nextstrain Datasets - From consensus sequences to advanced Nextstrain phylogenetic and genomic analysis, coupled with geographic and temporal data visualization and exploration of sequence metadata.

The Nextstrain (https://nextstrain.org/) (Hadfield et al., 2018) project has played an important role in harnessing the scientific and public health value of pathogen genome data in prevention and control of infectious diseases (well demonstrated during the COVID-19 pandemic), but also by providing up-to-date analyses of virus evolution at global scale as well as open-sourced analytic and visualization tools. In this context, in order to promote and facilitate the real-time tracking of virus evolution (from NGS reads to the tip of the tree), we strengthened the genomic surveillance component of the INSaFLU-TELEVIR platform by integrating Nextstrain workflows for advanced analysis, visualization and exploration of phylogenetic and genomic data together with geographic and temporal data (or any other epidemiologically relevant metadata variable). We provide the functionality of Nextstrain workflows as a new type of project named “NextStrain Dataset”. Upon creation of a new dataset, the user selects a specific Nextstrain build, either a virus-specific build (available for the four seasonal influenza, avian influenza, SARS-CoV-2, MPXV and RSV A/B, at the time of publication) or a “generic” build that can be used for other viruses (see Implementation). For instance, a TELEVIR partner (INIA) has successfully tested the generic build with West Nile Virus data, showing its applicability to several viral threats. After creation, users can then select samples to be included in the dataset from three sources. The most common origin of the samples are reference-based assembly projects (classical INSaFLU projects), from which generated consensus sequences and associated sample metadata are automatically sent to the dataset. Users can also import sequences from the References repository (especially useful when using the “generic” build”) as well as externally-provided sequences (directly uploaded as single or multi-fasta files). In the latter cases, since there is no associated metadata, default values are assumed for build-specific mandatory metadata parameters (e.g., collection date is defined as the current date). Still, at any time, users can download the automatically generated Nextstrain metadata table, and update the default values by uploading a modified metadata file (as a tabular tsv file). To take advantage of temporal and geographical features of Nextstrain and increase their robustness, users must provide: 1) “date” for all samples added to Nextstrain datasets - if no collection date is provided, INSaFLU will automatically insert the date of the analysis as the “collection date”, which might (considerably) bias (or even break) the time-scale trees; 2) “latitude” and “longitude” and/or “region”, “country”, “division” and/or “location” columns in the metadata - these values are screened against a database of geographical coordinates to geographically place the sequences in the Nextstrain map. When all samples are imported, and metadata is up to date, the user can then (re-)run the analysis and download the input consensus sequences (as a fasta file) and metadata table, as well as outputs from the build process, such as nucleotide alignments (as a fasta file), the divergence tree (as a newick file) and json file(s) that can be client-side visualized using auspice (https://auspice.us/). Consensus sequences imported into Nextstrain datasets can also be directly sent to Nextclade.

Impact

Since its first release [19], the INSaFLU (https://insaflu.insa.pt/) bioinformatics framework, which has been considerably upgraded as described in the present study, has played a pivotal role in pathogen genomics surveillance in Portugal, namely for SARS-CoV-2 (https://insaflu.insa.pt/covid19/; more than 48000 sequences analysed, as of October 2023), for influenza (around 1000 samples analysed in 2020-2023), and, more recently, for MPXV (around 600 samples in 2022-2023). This impact is well reflected in several works, namely in the rapid identification and characterization of emerging viral threats [8], in national [9,58,59] and local [60,61] outbreak tracking and in research studies in viral evolution [62,63], which ultimately contributed to strengthening the integration of the genomics pathogen surveillance in public health decision-making towards infectious diseases’ prevention and control. Moreover, coupled with a portable metagenomic virus detection wet-lab protocol [20], the newly developed TELEVIR bioinformatics component could be successfully tested in proof-of-concept studies conducted by TELEVIR consortium members, under very different conditions (https://onehealthejp.eu/projects/emerging-threats/jrp-tele-vir). These exercises in a real context not only provided a good complement to the multiple tests performed during the TELEVIR pipeline development, benchmarking and final refinement, but also introduced it as a bioinformatics resource of reference among several Public Health and Veterinary institutes across Europe. The achieved INSaFLU-TELEVIR versatility and functionality has also captured the attention of the international scientific community and key stakeholders in the field of public and animal health, leading to the considerable increase in the number of accounts created and published applications (e.g., [64–72]). This is well reflected by the multiple national and international activities that were conducted (or are being planned) to support the capacity building of several countries/laboratories in viral metagenomic detection and genomic surveillance through specific training in INSaFLU-TELEVIR. For instance, INSaFLU-TELEVIR has recently integrated ECDC training programmes, through AURORAE project (to support microbiology-related activities and capacity building focusing on COVID-19 and influenza in the EU/EEA, the Western Balkans and Turkey) and the GenEpi-BioTrain programme in Genomic Epidemiology and Public Health Bioinformatics (with focus on strengthening knowledge and skills for use and development of bioinformatics tools in the public health context) [73]. Through the MediLabSecure project (https://www.medilabsecure.com/project.html) consortium, INSaFLU-TELEVIR workshops were also recently organized to improve the surveillance and monitoring of emerging zoonotic diseases of viral origin in the Mediterranean, Black Sea and Sahel regions. The strong collaboration INSA has with countries of Portuguese official language is also prompting integration of INSaFLU-TELEVIR in emerging genomic surveillance systems in several African countries (namely, Guinea-Bissau, Angola and Cape Verde) with training sessions, support to local installation and other capacity building activities having recently occurred or being planned for the next years.

The early detection, characterization and surveillance of viruses is an urgent need at a global level in the light of the continuous emergence of viral threats, as recently observed in SARS-CoV-2 and MPXV epidemics. In this context, on behalf of the EU-funded TELEVIR project, in addition to the development of wet-lab protocols for virus metagenomic detection using ONT sequencing (addressed in Fomsgaard et al., 2023)[20], we built on an existing INSaFLU platform [19] to deploy a freely accessible and user-oriented “start-to-end” bioinformatics framework for viral metagenomic detection and routine genomic surveillance: the upgraded INSaFLU-TELEVIR platform (https://insaflu.insa.pt). The present report describes: i) the development, benchmarking and implementation of a novel virus detection module (TELEVIR), from reads and quality control to the identification of both RNA and DNA viruses; ii) the improvement and adaptation of the “surveillance-oriented” platform components to handle “multi-technology” sequence data (ONT, Illumina and Ion Torrent) of “any” virus, from reads to quality control, mutations detection, consensus generation, alignments, genotyping (through Nextclade and pangolin), “genotype-phenotype” screening (through incorporation and standalone release of “algn2pheno” for mutation screening), phylogenetics, integrative phylogeographical and temporal analysis (through incorporation of Nextstrain). Although it was challenging to develop this work in the context of a pandemic and a research field in continuous technological evolution (microbial genomics and metagenomics), the final platform functionalities benefited from the real-time software and data sharing at international level [74–76], leading to the integration of cutting-edge and state-of-the art bioinformatics features and resources [47,57,77]. In addition, we should highlight that the TELEVIR workflows for data analysis and reporting (as well as the tutorial and guidance provided in the extensive online Documentation) were strongly inspired and tried to cover state-of-art recommendations for introduction of metagenomic next-generation sequencing in clinical virology [18,24,25,52]. As such, the module does not propose to introduce yet another algorithmic approach to the already brimming field of metagenomics classification. Rather, TELEVIR focuses on harnessing the diversity of existing approaches in the interest of the end-user. Firstly, the infrastructure proposed makes available to the non-technical user the aforementioned methodologies, along with several databases. Secondly, this versatility is couched within a framework that promotes cross-validation between different methodologies. Finally, the presentation of final reports is provided in a way that is un-categorical, treating metagenomics hits as investigative leads rather than sure-fire results. We believe this latter point is particularly important, since it prevents an over reliance on automated diagnosis, emphasizing the importance of domain knowledge and informed interpretation. In addition, the further development of innovative complementary standalone tools, namely “findONTime” for real-time automated INSaFLU-TELEVIR data upload and analysis during MinION sequencing, can further contribute to promote a more timely and cost effective sequencing, reducing, as much as possible, the time gap between obtaining the sample and the viral detection and characterization. In summary, the achieved accessibility, the “pan-viral” nature and functionality of INSaFLU-TELEVIR has captured the attention of the scientific community, leading to the considerable increase in the number of users, and to the engagement of the INSaFLU-TELEVIR team in multiple national and international activities (conferences, training programmes, networks and projects) to strengthen the capacity for genomic epidemiology and “One Health” bioinformatics of laboratories/countries conducting surveillance of virus with impact on human and animal health. INSaFLU-TELEVIR is a free web-based, and locally installable, platform available at https://insaflu.insa.pt/.

Availability and requirements

Project name: INSaFLU-TELEVIR

Project home page: https://insaflu.insa.pt

Operating system(s): Platform independent

Programming language: python3.x, django

Other requirements: web browser, such as Firefox, Chrome or Safari

License: GNU license - GPL 2.0 (GNU General Public License. version 2) (https://opensource.org/licenses/GPL-2.0)

Any restrictions to use by non-academics: none

Ethics approval and consent to participate

Not applicable

Consent for publication

Not applicable

Availability of data and material

The INSaFLU-TELEVIR platform is available as a free online tool (https://insaflu.insa.pt) and as locally installable version (https://github.com/INSaFLU/docker). A snakemake pipeline to run the surveillance-oriented reference-based genome assembly INSaFLU component (for both ONT and Illumina) is available at https://github.com/INSaFLU/insaflu_snakemake. All INSaFLU documentation (latest) for each module is provided at http://insaflu.readthedocs.io/. Users can also walkthrough a INSaFLU-TELEVIR “demo” account available at the login page. The software source code is available under the GNU license - GPL 2.0 (https://opensource.org/licenses/GPL-2.0), from the GitHub repository at https://github.com/INSaFLU/INSaFLU. Raw read data used for INSaFLU (Additional file 2) and TELEVIR (Additional file 1) modules benchmark was deposited in the European Nucleotide Archive (ENA) (BioProject PRJEB67829), or is available under request to the TELEVIR consortium data providers (Supplementary Table 4). The code used to perform the exploratory data analysis (EDA) of the TELEVIR metagenomics classification benchmark is available at https://doi.org/10.5281/zenodo.8428029.

Competing interests

The authors declare that they have no competing interests.

Acknowledgements

This study was partially supported by the TELE-Vir project, the European Union’s Horizon 2020 Research and Innovation programme under grant agreement No 773830: One Health European Joint Programme. The improvement of the computational capacity of the online tool and its integration in INSA genomic surveillance workflows was also co-funded by the European Union through the Health Emergency Preparedness and Response (HERA) grant “Grant/2021/PHF/23776” and the project “Sustainable use and integration of enhanced infrastructure into routine genome-based surveillance and outbreak investigation activities in Portugal” on behalf of EU4H programme (EU4H-2022-DGA-MS-IBA-1). The development of the findONTime tool was also co-financed through the DURABLE project. The DURABLE project has been co-funded by the European Union, under the EU4Health Programme (EU4H), Project no. 101102733. Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or the European Health and Digital Executive Agency. Neither the European Union nor the granting authority can be held responsible for them. IZSLER participation was partially funded by the Italian national Research program no. B93C22001210001: CCM-SURVEID - Studio pilota per la sorveglianza di potenziali minacce da malattie infettive emergenti (EIDs) di origine virale mediante una piattaforma diagnostica basata sul sequenziamento metagenomico di nuova generazione (mNGS). CISA-INIA-CSIC participation was partially funded by MCIN/AEI/10.13039/501100011033 and by the EU “NextGenerationEU”/PRTR” through the Spanish project no. PLEC2021-007968: Development of New Technologies to Track Emerging Infectious Threats in Wildlife and the Environment (NEXTHREAT). Rafael Mamede was supported by the Fundação para a Ciência e Tecnologia (FCT) (grant 2020.08493.BD). We thank the European Society for Clinical Virology (ESCV) Network on NGS Clinical Virology (ENNGS) for releasing benchmark datasets that were used in this study. We thank Dr. Joaquin Prada (University of Surrey), Dr. Guido Cordoni (University of Surrey), Dr. Adriano Di Pasquale (IZSAM), Dr. Nicolas Radomski (IZSAM), Dr. Alessio Lorusso (IZSAM), Dr. Cesare Cammà (IZSAM), Dr, Sabrina Canziani (IZSLER), Miss Doriana Flores (ANSES), Dr. Pilar Aguilera-Sepúlveda (INIA), Dr. Irene Aldea (INIA), Dr. Iwona Kozyra (PIWET), Dr. Anna Fomsgaard (SSI) and Prof. Anders Fomsgaard (SSI) and all other TELEVIR participants for the productive discussions throughout TELEVIR development and implementation. We deeply thank the international scientific community for the open and real-time software and data sharing, which allowed us to integrate cutting-edge and state-of-the art bioinformatics features and resources. Special thanks to the Nextstrain (https://nextstrain.org/) team, for their amazing work in developing the open-source tools for phylogenetic and geotemporal tracking of viral pathogens that could be integrated into INSaFLU-TELEVIR. We finally thank the Infraestrutura Nacional de Computação Distribuída (INCD) for providing computational resources for INSaFLU-TELEVIR testing. INCD was funded by FCT and FEDER under project 22153-01/SAICT/2016. The funders had no role in study design, data collection and analysis, decision to publish or preparation of this publication.

TELEVIR consortium* authors:

Laurent Bigarré, ANSES, laboratory Ploufragan-Plouzané-Niort, 29280 Plouzané, France
Jovita Fernández-Pinero, Centro de Investigación en Sanidad Animal (CISA-INIA), CSIC, 28130 Valdeolmos, Madrid, Spain
Ricardo J. Pais, Genomics and Bioinformatics Unit, Department of Infectious Diseases, National Institute of Health Doutor Ricardo Jorge (INSA), Lisbon, Portugal
Maurilia Marcacci, Istituto Zooprofilattico Sperimentale dell'Abruzzo e del Molise (IZSAM), Teramo, Italy
Ana Moreno, Istituto Zooprofilattico Sperimentale della Lombardia ed Emilia Romagna (IZSLER), Via A. Bianchi, 9, 25124, Brescia, Italy
Tobias Lilja, National Veterinary Institute (NVA), Uppsala, Sweden
Øivind Øines, Norwegian Veterinary Institute (NVI), Norway
Artur Rzeżutka, Department of Food and Environmental Virology, National Veterinary Research Institute (️PIWET), Al. Partyzantów 57, 24-100, Puławy, Poland
Elisabeth Mathijs, Infectious Diseases in Animals, Sciensano, Rue Juliette Wytsmanstraat 14, 1050, Brussels, Belgium
Steven Van Borm, Infectious Diseases in Animals, Sciensano, Rue Juliette Wytsmanstraat 14, 1050, Brussels, Belgium
Morten Rasmussen, Statens Serum Institut, Copenhagen, Denmark
Katja Spiess, Statens Serum Institut, Copenhagen, Denmark

Struelens MJ, Brisse S. From molecular to genomic epidemiology: transforming surveillance and control of infectious diseases. Eurosurveillance [Internet]. 2013;18. Available from: https://www.eurosurveillance.org/content/10.2807/ese.18.04.20386-en
European Centre for Disease Prevention and Control (ECDC). Expert opinion on whole genome sequencing for public health surveillance. Stockholm: ECDC; 2016.
Eyre DW. Infection prevention and control insights from a decade of pathogen whole-genome sequencing. J Hosp Infect [Internet]. 2022;122:180–6. Available from: https://linkinghub.elsevier.com/retrieve/pii/S019567012200041X
Chen Z, Azman AS, Chen X, Zou J, Tian Y, Sun R, et al. Global landscape of SARS-CoV-2 genomic surveillance and data sharing. Nat Genet [Internet]. 2022;54:499–507. Available from: https://www.nature.com/articles/s41588-022-01033-y
Gardy JL, Loman NJ. Towards a genomics-informed, real-time, global pathogen surveillance system. Nat Rev Genet [Internet]. 2018;19:9–20. Available from: https://www.nature.com/articles/nrg.2017.88
Hill V, Githinji G, Vogels CBF, Bento AI, Chaguza C, Carrington CVF, et al. Toward a global virus genomic surveillance network. Cell Host Microbe [Internet]. 2023;31:861–73. Available from: https://linkinghub.elsevier.com/retrieve/pii/S1931312823001075
Hodcroft EB, Zuber M, Nadeau S, Vaughan TG, Crawford KHD, Althaus CL, et al. Spread of a SARS-CoV-2 variant through Europe in the summer of 2020. Nature [Internet]. 2021;595:707–12. Available from: https://www.nature.com/articles/s41586-021-03677-y
Isidro J, Borges V, Pinto M, Sobral D, Santos JD, Nunes A, et al. Phylogenomic characterization and signs of microevolution in the 2022 multi-country outbreak of monkeypox virus. Nat Med [Internet]. 2022;28:1569–72. Available from: https://www.nature.com/articles/s41591-022-01907-y
Borges V, Duque MP, Martins JV, Vasconcelos P, Ferreira R, Sobral D, et al. Viral genetic clustering and transmission dynamics of the 2022 mpox outbreak in Portugal. Nat Med [Internet]. 2023;29:2509–17. Available from: https://www.nature.com/articles/s41591-023-02542-x
World Health Organization. WHO Guiding principles for pathogen genome data sharing. Geneva: World Health Organization; 2022.
Lefrançois T, Malvy D, Atlani-Duault L, Benamouzig D, Druais P-L, Yazdanpanah Y, et al. After 2 years of the COVID-19 pandemic, translating One Health into action is urgent. Lancet [Internet]. 2023;401:789–94. Available from: https://linkinghub.elsevier.com/retrieve/pii/S0140673622018402
European Centre for Disease Prevention and Control (ECDC). Lessons from the COVID-19 pandemic. Stockholm: ECDC; 2023.
Hoang T, da Silva AG, Jennison A V., Williamson DA, Howden BP, Seemann T. AusTrakka: Fast-tracking nationalized genomics surveillance in response to the COVID-19 pandemic. Nat Commun [Internet]. 2022;13:865. Available from: https://www.nature.com/articles/s41467-022-28529-9
Tegally H, San JE, Cotten M, Moir M, Tegomoh B, Mboowa G, et al. The evolving SARS-CoV-2 epidemic in Africa: Insights from rapidly expanding genomic surveillance. Science (80-) [Internet]. 2022;378. Available from: https://www.science.org/doi/10.1126/science.abq5358
Nicholls SM, Poplawski R, Bull MJ, Underwood A, Chapman M, Abu-Dahab K, et al. CLIMB-COVID: continuous integration supporting decentralised sequencing for SARS-CoV-2 genomic surveillance. Genome Biol [Internet]. 2021;22:196. Available from: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02395-y
Brito AF, Semenova E, Dudas G, Hassler GW, Kalinich CC, Kraemer MUG, et al. Global disparities in SARS-CoV-2 genomic surveillance. Nat Commun [Internet]. 2022;13:7003. Available from: https://www.nature.com/articles/s41467-022-33713-y
Inzaule SC, Tessema SK, Kebede Y, Ogwell Ouma AE, Nkengasong JN. Genomic-informed pathogen surveillance in Africa: opportunities and challenges. Lancet Infect Dis [Internet]. 2021;21:e281–9. Available from: https://linkinghub.elsevier.com/retrieve/pii/S1473309920309397
Jurasz H, Pawłowski T, Perlejewski K. Contamination Issue in Viral Metagenomics: Problems, Solutions, and Clinical Perspectives. Front Microbiol [Internet]. 2021;12. Available from: https://www.frontiersin.org/articles/10.3389/fmicb.2021.745076/full
Borges V, Pinheiro M, Pechirra P, Guiomar R, Gomes JP. INSaFLU: an automated open web-based bioinformatics suite “from-reads” for influenza whole-genome-sequencing-based surveillance. Genome Med. 2018;10:46.
Fomsgaard AS, Tahas SA, Spiess K, Polacek C, Fonager J, Belsham GJ. Unbiased Virus Detection in a Danish Zoo Using a Portable Metagenomic Sequencing System. Viruses [Internet]. 2023;15:1399. Available from: https://www.mdpi.com/1999-4915/15/6/1399
Nooij S, Schmitz D, Vennema H, Kroneman A, Koopmans MPG. Overview of Virus Metagenomic Classification Methods and Their Biological Applications. Front Microbiol [Internet]. 2018;9. Available from: http://journal.frontiersin.org/article/10.3389/fmicb.2018.00749/full
Brinkmann A, Andrusch A, Belka A, Wylezich C, Höper D, Pohlmann A, et al. Proficiency Testing of Virus Diagnostics Based on Bioinformatics Analysis of Simulated In Silico High-Throughput Sequencing Data Sets. Tang Y-W, editor. J Clin Microbiol [Internet]. 2019;57. Available from: https://journals.asm.org/doi/10.1128/JCM.00466-19
Ye SH, Siddle KJ, Park DJ, Sabeti PC. Benchmarking Metagenomics Tools for Taxonomic Classification. Cell [Internet]. 2019;178:779–94. Available from: https://linkinghub.elsevier.com/retrieve/pii/S0092867419307755
de Vries JJC, Brown JR, Fischer N, Sidorov IA, Morfopoulou S, Huang J, et al. Benchmark of thirteen bioinformatic pipelines for metagenomic virus diagnostics using datasets from clinical samples. J Clin Virol [Internet]. 2021;141:104908. Available from: https://linkinghub.elsevier.com/retrieve/pii/S138665322100175X
de Vries JJC, Brown JR, Couto N, Beer M, Le Mercier P, Sidorov I, et al. Recommendations for the introduction of metagenomic next-generation sequencing in clinical virology, part II: bioinformatic analysis and reporting. J Clin Virol [Internet]. 2021;138:104812. Available from: https://linkinghub.elsevier.com/retrieve/pii/S1386653221000792
MacDonald ML, Polson SW, Lee KH. k -mer-Based Metagenomics Tools Provide a Fast and Sensitive Approach for the Detection of Viral Contaminants in Biopharmaceutical and Vaccine Manufacturing Applications Using Next-Generation Sequencing. Rasmussen AL, editor. mSphere [Internet]. 2021;6. Available from: https://journals.asm.org/doi/10.1128/mSphere.01336-20
Cantu VA, Sadural J, Edwards R. PRINSEQ++, a multi-threaded tool for fast and efficient quality control and preprocessing of sequencing datasets. PeerJ Prepr. 2019;7:e27553v1.
Pérez-Losada M, Arenas M, Galán JC, Palero F, González-Candelas F. Recombination in viruses: Mechanisms, methods of study, and evolutionary consequences. Infect Genet Evol [Internet]. 2015;30:296–307. Available from: https://linkinghub.elsevier.com/retrieve/pii/S156713481400478X
Hovhannisyan H, Hafez A, Llorens C, Gabaldón T. CROSSMAPPER: estimating cross-mapping rates and optimizing experimental design in multi-species sequencing studies. Berger B, editor. Bioinformatics [Internet]. 2020;36:925–7. Available from: https://academic.oup.com/bioinformatics/article/36/3/925/5544929
Zhao C, Shi ZJ, Pollard KS. Pitfalls of genotyping microbial communities with rapidly growing genome collections. Cell Syst [Internet]. 2023;14:160–176.e3. Available from: https://linkinghub.elsevier.com/retrieve/pii/S2405471222004951
Daly GM, Leggett RM, Rowe W, Stubbs S, Wilkinson M, Ramirez-Gonzalez RH, et al. Host Subtraction, Filtering and Assembly Validations for Novel Viral Discovery Using Next Generation Sequencing Data. Jordan IK, editor. PLoS One [Internet]. 2015;10:e0129059. Available from: https://dx.plos.org/10.1371/journal.pone.0129059
Roux S, Emerson JB, Eloe-Fadrosh EA, Sullivan MB. Benchmarking viromics: an in silico evaluation of metagenome-enabled estimates of viral community composition and diversity. PeerJ. 2017;5:e3817.
Chiu CY, Miller SA. Clinical metagenomics. Nat Rev Genet [Internet]. 2019;20:341–55. Available from: https://www.nature.com/articles/s41576-019-0113-7
Alneberg J, Bjarnason BS, de Bruijn I, Schirmer M, Quick J, Ijaz UZ, et al. Binning metagenomic contigs by coverage and composition. Nat Methods [Internet]. 2014;11:1144–6. Available from: https://www.nature.com/articles/nmeth.3103
De Coster W, D’Hert S, Schultz DT, Cruts M, Van Broeckhoven C. NanoPack: visualizing and processing long-read sequencing data. Berger B, editor. Bioinformatics [Internet]. 2018;34:2666–9. Available from: https://academic.oup.com/bioinformatics/article/34/15/2666/4934939
Yin Z, Zhang H, Liu M, Zhang W, Song H, Lan H, et al. RabbitQC: high-speed scalable quality control for sequencing data. Berger B, editor. Bioinformatics [Internet]. 2021;37:573–4. Available from: https://academic.oup.com/bioinformatics/article/37/4/573/5892252
Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, et al. Twelve years of SAMtools and BCFtools. Gigascience [Internet]. 2021;10. Available from: http://www.ncbi.nlm.nih.gov/pubmed/33590861
Cingolani P, Platts A, Wang LL, Coon M, Nguyen T, Wang L, et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff. Fly (Austin) [Internet]. 2012;6:80–92. Available from: http://www.tandfonline.com/doi/abs/10.4161/fly.19695
Darling AE, Mau B, Perna NT. progressiveMauve: Multiple Genome Alignment with Gene Gain, Loss and Rearrangement. Stajich JE, editor. PLoS One [Internet]. 2010;5:e11147. Available from: https://dx.plos.org/10.1371/journal.pone.0011147
Katoh K, Standley DM. MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Mol Biol Evol [Internet]. 2013;30:772–80. Available from: https://academic.oup.com/mbe/article-lookup/doi/10.1093/molbev/mst010
Price MN, Dehal PS, Arkin AP. FastTree: Computing Large Minimum Evolution Trees with Profiles instead of a Distance Matrix. Mol Biol Evol [Internet]. 2009;26:1641–50. Available from: https://academic.oup.com/mbe/article-lookup/doi/10.1093/molbev/msp077
Shepard SS, Meno S, Bahl J, Wilson MM, Barnes J, Neuhaus E. Viral deep sequencing needs an adaptive approach: IRMA, the iterative refinement meta-assembler. BMC Genomics [Internet]. 2016;17:708. Available from: https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-3030-6
Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics [Internet]. 2009;25:1754–60. Available from: https://academic.oup.com/bioinformatics/article/25/14/1754/225615
Grubaugh ND, Gangavarapu K, Quick J, Matteson NL, De Jesus JG, Main BJ, et al. An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar. Genome Biol [Internet]. 2019;20:8. Available from: http://www.ncbi.nlm.nih.gov/pubmed/30621750
Vaser R, Šikić M. Time- and memory-efficient genome assembly with Raven. Nat Comput Sci [Internet]. 2021;1:332–6. Available from: https://www.nature.com/articles/s43588-021-00073-4
Rambaut A, Holmes EC, O’Toole Á, Hill V, McCrone JT, Ruis C, et al. A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nat Microbiol [Internet]. 2020;5:1403–7. Available from: https://www.nature.com/articles/s41564-020-0770-5
Hadfield J, Megill C, Bell SM, Huddleston J, Potter B, Callender C, et al. Nextstrain: real-time tracking of pathogen evolution. Bioinformatics. 2018;34:4121–3.
Huddleston J, Hadfield J, Sibley TR, Lee J, Fay K, Ilcisin M, et al. Augur: a bioinformatics toolkit for phylogenetic analyses of human pathogens. J open source Softw [Internet]. 2021;6. Available from: http://www.ncbi.nlm.nih.gov/pubmed/34189396
Carabelli AM, Peacock TP, Thorne LG, Harvey WT, Hughes J, COVID-19 Genomics UK Consortium, et al. SARS-CoV-2 variant biology: immune escape, transmission and fitness. Nat Rev Microbiol [Internet]. 2023;21:162–77. Available from: http://www.ncbi.nlm.nih.gov/pubmed/36653446
Merkel D. Docker: lightweight linux containers for consistent development and deployment. Linux j. 2014;239:2.
Köster J, Rahmann S. Snakemake - a scalable bioinformatics workflow engine. Bioinformatics [Internet]. 2012;28:2520–2. Available from: http://www.ncbi.nlm.nih.gov/pubmed/22908215
López-Labrador FX, Brown JR, Fischer N, Harvala H, Van Boheemen S, Cinek O, et al. Recommendations for the introduction of metagenomic high-throughput sequencing in clinical virology, part I: Wet lab procedure. J Clin Virol [Internet]. 2021;134:104691. Available from: http://www.ncbi.nlm.nih.gov/pubmed/33278791
Rosseel T, Pardon B, De Clercq K, Ozhelvaci O, Van Borm S. False-Positive Results in Metagenomic Virus Discovery: A Strong Case for Follow-Up Diagnosis. Transbound Emerg Dis [Internet]. 2014;61:293–9. Available from: https://onlinelibrary.wiley.com/doi/10.1111/tbed.12251
Davis NM, Proctor DM, Holmes SP, Relman DA, Callahan BJ. Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data. Microbiome [Internet]. 2018;6:226. Available from: https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-018-0605-2
McLaren MR, Willis AD, Callahan BJ. Consistent and correctable bias in metagenomic sequencing experiments. Elife [Internet]. 2019;8. Available from: https://elifesciences.org/articles/46923
Eisenhofer R, Minich JJ, Marotz C, Cooper A, Knight R, Weyrich LS. Contamination in Low Microbial Biomass Microbiome Studies: Issues and Recommendations. Trends Microbiol [Internet]. 2019;27:105–17. Available from: http://www.ncbi.nlm.nih.gov/pubmed/30497919
O’Toole Á, Scher E, Underwood A, Jackson B, Hill V, McCrone JT, et al. Assignment of epidemiological lineages in an emerging pandemic using the pangolin tool. Virus Evol [Internet]. 2021;7. Available from: https://academic.oup.com/ve/article/doi/10.1093/ve/veab064/6315289
Borges V, Isidro J, Cortes-Martins H, Duarte S, Vieira L, Leite R, et al. Massive dissemination of a SARS-CoV-2 Spike Y839 variant in Portugal. Emerg Microbes Infect [Internet]. 2020;9:2488–96. Available from: https://www.tandfonline.com/doi/full/10.1080/22221751.2020.1844552
Borges V, Isidro J, Trovão NS, Duarte S, Cortes-Martins H, Martiniano H, et al. SARS-CoV-2 introductions and early dynamics of the epidemic in Portugal. Commun Med [Internet]. 2022;2:10. Available from: https://www.nature.com/articles/s43856-022-00072-0
Borges V, Isidro J, Macedo F, Neves J, Silva L, Paiva M, et al. Nosocomial Outbreak of SARS-CoV-2 in a “Non-COVID-19” Hospital Ward: Virus Genome Sequencing as a Key Tool to Understand Cryptic Transmission. Viruses [Internet]. 2021;13:604. Available from: https://www.mdpi.com/1999-4915/13/4/604
Sá R, Isidro J, Borges V, Duarte S, Vieira L, Gomes JP, et al. Unraveling the hurdles of a large COVID-19 epidemiological investigation by viral genomics. J Infect [Internet]. 2022;85:64–74. Available from: https://linkinghub.elsevier.com/retrieve/pii/S0163445322003024
Borges V, Isidro J, Cunha M, Cochicho D, Martins L, Banha L, et al. Long-Term Evolution of SARS-CoV-2 in an Immunocompromised Patient with Non-Hodgkin Lymphoma. Paul Duprex W, editor. mSphere [Internet]. 2021;6. Available from: https://journals.asm.org/doi/10.1128/mSphere.00244-21
Amicone M, Borges V, Alves MJ, Isidro J, Zé-Zé L, Duarte S, et al. Mutation rate of SARS-CoV-2 and emergence of mutators during experimental evolution. Evol Med Public Heal [Internet]. 2022;10:142–55. Available from: https://academic.oup.com/emph/article/10/1/142/6555377
Ghorbani A, Ngunjiri JM, Rendon G, Brooke CB, Kenney SP, Lee C-W. Diversity and Complexity of Internally Deleted Viral Genomes in Influenza a Virus Subpopulations with Enhanced Interferon-Inducing Phenotypes. Viruses [Internet]. 2023;15:2107. Available from: https://www.mdpi.com/1999-4915/15/10/2107
Agustí C, Martínez-Riveros H, Hernández-Rodríguez À, Casañ C, Díaz Y, Alonso L, et al. Self-sampling monkeypox virus testing in high-risk populations, asymptomatic or with unrecognized Mpox, in Spain. Nat Commun [Internet]. 2023;14:5998. Available from: https://www.nature.com/articles/s41467-023-40490-9
Trovao NS, Pan V, Goel C, Gallego-García P, Liu Y, Barbara C, et al. Evolutionary and spatiotemporal analyses reveal multiple introductions and cryptic transmission of SARS-CoV-2 VOC/VOI in Malta. Mostafa HH, editor. Microbiol Spectr [Internet]. 2023; Available from: https://journals.asm.org/doi/10.1128/spectrum.01539-23
Piralla A, Borghesi A, Di Comite A, Giardina F, Ferrari G, Zanette S, et al. Fulminant echovirus 11 hepatitis in male non-identical twins in northern Italy, April 2023. Eurosurveillance [Internet]. 2023;28. Available from: https://www.eurosurveillance.org/content/10.2807/1560-7917.ES.2023.28.24.2300289
Rabalski L, Kosinski M, Cybulski P, Stadejek T, Lepek K. Genetic Diversity of Type A Influenza Viruses Found in Swine Herds in Northwestern Poland from 2017 to 2019: The One Health Perspective. Viruses [Internet]. 2023;15:1893. Available from: https://www.mdpi.com/1999-4915/15/9/1893
Sansone M, Andersson M, Gustavsson L, Andersson L-M, Nordén R, Westin J. Extensive Hospital In-Ward Clustering Revealed By Molecular Characterization of Influenza A Virus Infection. Clin Infect Dis [Internet]. 2020; Available from: https://academic.oup.com/cid/advance-article/doi/10.1093/cid/ciaa108/5721380
Mengual-Chuliá B, Alonso-Cordero A, Cano L, Mosquera M del M, de Molina P, Vendrell R, et al. Whole-Genome Analysis Surveillance of Influenza A Virus Resistance to Polymerase Complex Inhibitors in Eastern Spain from 2016 to 2019. Antimicrob Agents Chemother [Internet]. 2021;65. Available from: https://journals.asm.org/doi/10.1128/AAC.02718-20
Ghorbani A, Abundo MC, Ji H, Taylor KJM, Ngunjiri JM, Lee C-W. Viral Subpopulation Screening Guides in Designing a High Interferon-Inducing Live Attenuated Influenza Vaccine by Targeting Rare Mutations in NS1 and PB2 Proteins. Schultz-Cherry S, editor. J Virol [Internet]. 2020;95. Available from: https://journals.asm.org/doi/10.1128/JVI.01722-20
Zé-Zé L, Borges V, Osório HC, Machado J, Gomes JP, Alves MJ. Mitogenome diversity of Aedes (Stegomyia) albopictus: Detection of multiple introduction events in Portugal. Bonizzoni M, editor. PLoS Negl Trop Dis [Internet]. 2020;14:e0008657. Available from: https://dx.plos.org/10.1371/journal.pntd.0008657
European Centre for Disease Prevention and Control (ECDC). Training in genomic epidemiology and public health bioinformatics [Internet]. ECDC. 2023. Available from: https://www.ecdc.europa.eu/en/news-events/training-genomic-epidemiology-and-public-health-bioinformatics
Gangavarapu K, Latif AA, Mullen JL, Alkuzweny M, Hufbauer E, Tsueng G, et al. Outbreak.info genomic reports: scalable and dynamic surveillance of SARS-CoV-2 variants and mutations. Nat Methods [Internet]. 2023;20:512–22. Available from: https://www.nature.com/articles/s41592-023-01769-3
Chen C, Nadeau S, Yared M, Voinov P, Xie N, Roemer C, et al. CoV-Spectrum: analysis of globally shared SARS-CoV-2 data to identify and characterize new variants. Alkan C, editor. Bioinformatics [Internet]. 2022;38:1735–7. Available from: https://academic.oup.com/bioinformatics/article/38/6/1735/6483076
Khare S, Gurry C, Freitas L, B Schultz M, Bach G, Diallo A, et al. GISAID’s Role in Pandemic Response. China CDC Wkly [Internet]. 2021;3:1049–51. Available from: http://weekly.chinacdc.cn/en/article/doi/10.46234/ccdcw2021.255
Aksamentov I, Roemer C, Hodcroft E, Neher R. Nextclade: clade assignment, mutation calling and quality control for viral genomes. J Open Source Softw [Internet]. 2021;6:3773. Available from: https://joss.theoj.org/papers/10.21105/joss.03773

Download PDF

Journal Publication

published 24 Apr, 2024

Read the published version in Genome Medicine →

Version 1

posted

You are reading this latest preprint version

INSaFLU-TELEVIR: an open web-based bioinformatics suite for viral metagenomic detection and routine genomic surveillance

Status:

Journal Publication

Version 1

Abstract

Figures

Background

Implementation

Results and Discussion

Conclusion

Declarations

References

Supplementary Files

Status:

Journal Publication

Version 1