Reproducing a large-scale study on genetic clustering and linkage to antibiotic resistance data in Neisseria gonorrhoeae
Our team has recently performed an extensive genomics analysis of the bacterial pathogen Neisseria gonorrhoeae (Pinto et al., 2021). In this study, 3,791 N. gonorrhoeae genomes from isolates collected across Europe were analyzed with a cgMLST approach. Genetic clusters were determined with the goeBURST algorithm implemented in PHYLOViZ (Francisco et al., 2009, 2012; Nascimento et al., 2017; Ribeiro-Gonçalves et al., 2016) for all possible allelic distance thresholds (partitions). Cluster concordance between subsequent distance thresholds was assessed with the nAWC in order to determine regions of cluster stability (Barker et al., 2018; Carriço et al., 2006; Llarena et al., 2018; Severiano et al., 2011) that were used for nomenclature purposes and identification of genogroups. The association between metadata and genetic clusters was then performed by time-consuming table handling with a spreadsheet program. This corresponded to a non-automated workflow and, in the particular case of the cluster congruence analysis and the integration of genetic and clinically or epidemiologically relevant data, it represented a highly demanding process difficult to be applied in real-time pathogen surveillance. As such, to validate ReporTree and demonstrate how it can enhance bacterial pathogens’ surveillance and research, we used the same dataset as in the previous study (Pinto et al., 2021) and attempted to reproduce the main study outputs with this tool. As shown at https://github.com/insapathogenomics/ReporTree/wiki/, using as input the allele matrix with 822 loci (available at https://zenodo.org/record/3946223#.YhTKQy8qKrw) and the associated metadata (available in Supplementary material 1 of (Pinto et al., 2021)), ReporTree automatically identified the genetic clusters at all possible partition thresholds of the generated MST and identified the same regions of cluster stability as Pinto et al.. Moreover, it provided an updated metadata table with clustering information at the first partition of each stability region, which could be used as input for visualization in GrapeTree (Zhou et al., 2018). Furthermore, summary reports with statistics/trends associated with each genetic cluster of low and high levels of stability (i.e. 40 allele differences at the lower level and 79 allele differences at the higher level, similarly to what was found by Pinto et al.) were reported. Finally, ReporTree was able to associate and report the distribution of genetic determinants of antimicrobial resistance in N. gonorrhoeae for the different genetic clusters. Importantly, this example allowed a clear validation of the tool by rigorously reproducing the data presented, for example, in Figs. 1a, 1b and 3 and in Tables 1 and 2 of (Pinto et al., 2021). All these outputs (and additional ones) are available for consultation at https://github.com/insapathogenomics/ReporTree/tree/main/examples/Neisseria. Noteworthy, this proof of concept was made with a single command line that ran for approximately 1min 39sec in a laptop [Intel Core i5(R)] with 16 GB of RAM.
ReporTree and its application to genomics-informed routine surveillance (e.g., SARS-CoV-2) and outbreak detection (e.g., Listeria monocytogenes)
Genomics-informed surveillance of SARS-CoV-2 has had an important role in worldwide public health and political decision-making in the last two years. In Portugal, weekly reports of nationwide sequencing surveys are provided to public health authorities and general public describing important indicators and trends of the evolution and geotemporal spread of the virus (https://insaflu.insa.pt/covid19/). Therefore, after ReporTree validation, we implemented this tool in the routine genomics surveillance of SARS-CoV-2 in the country with the objective of speeding-up the association between genomic and epidemiological data and the generation of the surveillance-oriented reports. For instance, besides its comprehensive usage for monitoring the relative frequency of variants of concern (VOCs) at regional and national levels, ReporTree is applied to identify clusters of high-closely related viruses (e.g., using TreeCluster (Balaban et al., 2019) max-clade or avg-clade models at high resolution levels) that may represent local transmission networks or even super-spreading events. An example of ReporTree application in the context of SARS-CoV-2 genomic epidemiology is provided in https://github.com/insapathogenomics/ReporTree/wiki/.
ReporTree can be useful to a broad spectrum of species. One of the most direct and intuitive applications is the analysis of cg/wgMLST data for outbreak investigation, namely for foodborne bacterial pathogens, as this subtyping method delivers sufficiently high resolution and epidemiological concordance (Nadon et al., 2017). In ReporTree wiki (https://github.com/insapathogenomics/ReporTree/wiki/), it is provided a simple simulated example in which, with a single command line, ReporTree builds a MST from cgMLST data and automatically extracts and reports genetic clusters of Listeria monocytogenes at high resolution levels commonly used for outbreak detection (< 5 and < 8 allelic differences, (Van Walle et al., 2018)), as routinely performed in Portugal.
In both examples, ReporTree is a useful asset by rapidly generating summary statistics/trends for key surveillance metrics, such as the relative frequency of the different (sub-)lineages/clusters circulating in the country over time. Noteworthy, ReporTree was designed to provide a high degree of flexibility, allowing, for example, the rapid production of summary and count/frequency reports. While summary reports include the distribution of any (and as many) user-specified variable of interest (e.g. vaccination status, source, country, timespan, etc.) over any user-specified grouping variable (e.g. lineage/cluster or age), count/relative frequency reports include the distribution of one grouping variable over any other user-specified variable (e.g. lineage/cluster per week, or clade per country). When the time variable "date" is provided in the metadata, ReporTree automatically infers other time units (ISO week and ISO year) and metrics (e.g., cluster timespan) relevant for surveillance purposes. Furthermore, ReporTree allows the application of filters in the metadata table to select the samples that will be included in the report without the need of generating a new metadata table.