Assessing completeness of assembled genomes was done with Quast by mapping them against the H37Rv reference genome. Lineage 6 genomes had ≥ 98.069% alignment, lineage 4 had ≥ 96.145% and lineage 2 had ≥ 96.856% alignment with the reference genome. A complete report on genome assembly statistics is shown in Supplementary table 1. In order to avoid sub-population bias, a transmission analysis was performed by constructing a phylogenetic tree using core genomes rooted against the H37Rv reference genome in each lineages in this study as seen in Fig. 1.
Mycobacterium africanum (Lineage 6):
The identification of core genes (genes shared between all strains), accessory genes (genes shared by two or more strains but not all) and unique genes (genes peculiar to individual strains) were clustered respectively. The pan-genome analysis across 22 genomes of M. africanum showed that they have 3974 ± 13 genes (mean ± standard deviation). The number of accessory genes ranged from 115 to 159 genes and the unique set of genes varied from 1 to 54 genes (Supplementary table 2A). Empirical power law equations and exponential equations were used to generate pan- and core genome curves, which showed that the pan-genome curve has almost reached a plateau as the exponent parameter calculated is 0.03 (Fig. 2). This argues that the 22 genomes analysed are sufficient to obtain an accurate estimate of pan- and core genome size, with additional samples yielding diminishing returns.
Pan-genome functional analysis was done using KEGG pathway database and KEGG IDs assigned to orthologous protein clusters in the core, accessory and singleton genes and matched against the database, as shown in Fig. 3. The highest gene contents in the core genomes of M. africanum were responsible for biological processes such as metabolism, whilst a remarkable amount of unique genes account for environmental information processing and organismal systems. A more detailed KEGG classification showed that majority of unique gene sets were responsible for signalling molecules and interaction, signal transduction, infectious diseases, digestive system and cellular community (Fig. 4).
Mycobacterium tuberculosis lineage 4 (Euro-American lineage):
The classification of core, accessory and unique genes in numbers after clustering into orthologous groups was performed. The 21 genomes from the Euro-American lineage had 4067 ± 18 genes (mean ± standard deviation). Accessory gene numbers ranged from 242 to 289 and singleton gene numbers were from 19 to 55 genes (Supplementary table 2B). Power law equations and exponential equations were used to generate pan and core genome curves which reflects the curve has almost reached a plateau as the exponent parameter calculated is 0.06 (Fig. 2).
A high percentage of genes in the core genome were linked to metabolic processes after functional classification of the pan-genome (Fig. 3). A more detailed KEGG classification of the pan genome showed that majority of the accessory genes are related to infectious diseases and cellular community (Fig. 4).
Mycobacterium tuberculosis lineage 2 (Beijing lineage):
Orthologous clustering of the pan-genome showed that the analysis of 24 genomes have 4074 ± 29 genes (mean ± standard deviation). The number of accessory genes ranged from 216 to 262 genes, with 3825 genes shared between all strains (core genes) and unique gene sets varied from 1 to 121 genes per strain (Supplementary table 2C). Exponential regression was used to generate the number of core genes whilst power law regression was used to extrapolate the pan-genome curve which reflects a slightly closed pan-genome with an exponent parameter 0.03 (Fig. 2).
KEGG classification of core, accessory and unique genes into functional roles by clustering genes into orthologous groups argued that the majority of core, accessory and unique genes are responsible for metabolic functions (Fig. 3) and a detailed classification of functional genes showed that a relative quantity of the core genes are associated with carbohydrate metabolism, whilst the accessory genes are linked with cellular community, amino acid metabolism and infectious diseases (Fig. 4).
Pan-genome comparative analysis
Following the clustering of respective TB datasets into core, accessory and unique genes, the core genes of lineage 6 were higher in number than core genes in lineages 4 and 2 but had fewer accessory genes than these lineages (Supplementary table 2). The red, orange and cyan lines in Fig. 2 (a, b and c) represent the power-fit curve derived from the equation (f(x) = a.x^b), where the exponent b > 0 implying that the genome is open, however the parameter b values are 0.03, 0.06 and 0.03 (Lineage 6, Lineage 4 and Lineage 2 respectively) meaning the pan genome is almost closed and addition of new genomes may not lead to the discovery of novel functions (30).
Using BPGA to cluster the core, accessory and unique genes and assigning KEGG functional pathways to them, about 92% of the core genes of the three lineages are responsible for metabolism related pathways. However, accessory gene assignment of metabolic, organismal system and environmental information related pathways in the lineages are: Lineage 6 (33%, 59% and 58%), Lineage 4 (23%, 28% and 33%) and Lineage 2 (47%, 18% and 21%) as shown in Fig. 4.
Evolution of M. africanum
Core genes are responsible for survival and majority of biological processes in the bacteria. Those that exist in copies, also known as “multi copy core genes” (MCG) were studied as these genes, especially rRNAs in bacteria, have been seen to influence adaptation to environmental pressures and the structure of microbiomes (31). We examined conserved genes of these lineages that exhibit copy number variation (CNV) and constructed a phylogeny based on their core genomes (Fig. 5). Eleven (11) PE/PPE family proteins and four (4) fadD genes fall under the category of MCG. PPE family proteins and fadD genes were selected as they have been reported to be crucial factors for mycobacterial virulence and in vivo pathogenicity (32–34).
PPE family proteins 15 and 26 all had similar gene copy number across all strains, which shows that these proteins have remained unchanged over time, whilst family PPE proteins 20, 47/48, 57, 51 and 42 showed little evolution in the CNV and lastly, PPE proteins 40, 32, 29 and 33 displayed lineages distinct CNV across the genomes (Fig. 5).
On the other set of gene families studied, fadD13, fadD15, fadD25 and fadD32 genes which are responsible for fatty acid CoA ligase, all showed CNV in the TB strains during the course of evolution except the fadD32 gene (Fig. 5). Additionally, multiple sequence alignment of fadD13 genes in MTBC genomes showed substitution mutations A8G in fadD13_2, S108F and A123V in fadD13_3 (Fig. 6).