Background
The increased number of accessible genomes has prompted large-scale comparative studies for decerning evolutionary knowledge of infectious diseases, but challenges such as non-availability of close reference sequence(s), incompletely assembled or large number of genomes, preclude real time multiple sequence alignment and sub-strain(s) discovery. This paper introduces a cooperatively inspired open-source framework, for intelligent mining of severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) genomes. We situate this study within the African context, to drive advancement on state-of-the-art, towards intelligent infectious disease characterization and prediction. The outcome is an enriched Knowledge Base, sufficient to provide deep understanding of the viral sub-strains’ identification problem. We also open investigation by gender, which to the best of our knowledge has been ignored in related research. Data for the study came from the Global Initiative on Sharing All Influenza Data database (https://gisaid.org) and processed for precise discovery of viral sub-strains transmission between and within African countries. To localize the transmission route(s) of each isolate excavated and provide appropriate links to similar isolate strain(s), a cognitive solution was imposed on the genome expression patterns discovered by unsupervised self-organizing map (SOM) component planes visualization. The Freidman-Nemenyi’s test was finally performed to validate our claim.
Results
Evidence of inter- and intra-genome diversity was noticed. While some isolates (or genomes) clustered differently, implying different evolutionary source (or high-diversity), others clustered closely together, indicating similar evolutionary source (or less-diversity). SOM component planes analysis revealed multiple sub-strains patterns, strongly suggesting local or intra-community and country to country transmissions. Cognitive maps of both male and female isolates revealed multiple transmission routes. Statistical results indicate significant difference between the various isolate groups at the 0.05 level of significance.
Conclusion
The proposed framework offers explanations to SARS-CoV-2 diversity and provides real time identification to disease transmission routes, as well as rapid decision support for facilitating inter- and intra-country contact tracing of infected case(s). Intermediate data produced in this paper are helpful to enrich the genome datasets for intelligent characterization and prediction of COVID-19 and related pandemics, as well as the construction of intelligent device for accurate infectious disease monitoring.