Collection and genomic sequencing of 1296 Mtb isolates from Zhejiang Province
From 1998 to 2013, a total of 1434 clinical isolates were collected; of them, 1372 (95.67%) were culture-positive and 1329 (96.87%) met our predefined criteria for the sequencing purity and concentration. Thirty-three isolates that were cross-contaminated or did not represent Mtb were excluded. In total, 1296 isolates were included for our analysis (Fig. 1).
Figure 1. Clinical isolates collected in Zhejiang, 1998–2013.
Phylogenetic characteristics of the lineage 2 and lineage 4 strains
WGS data were obtained from the 1296 Mtb isolates from Zhejiang Province and downloaded for 1154 previously studied isolates that were obtained from around the world and represented the six main previously-defined phylogeographic lineages of Mtb. These data were used to construct phylogenetic trees (Fig. 2). Of the 1296 Zhejiang isolates, 964 (74.38%) belonged to lineage 2 and 332 (25.62%) belonged to lineage 4. We next selected a subset of lineage 4 clinical isolates (n = 771) from 17 countries and a subset of lineage 2 clinical isolates (n = 383) from 12 countries. To determine the placement of the Zhejiang strains along the evolutionary path of these lineages, we reconstructed maximum-likelihood phylogenies for lineages 2 and 4. The phylogenetic trees showed that lineage 2 comprises three sub-lineages, L2.1 (10.17%), L2.2 (32.57%) and L2.3 (57.26%); among them, L2.3 (552 strains) was the predominant sub-lineage in Zhejiang Province, accounting for 42.59% of the total strains. Lineage 4 was found to comprise three sub-lineages, L4.2 (18.07%), L4.4 (38.56%) and L4.5 (43.37%).
Figure 2. Bayesian phylogeny of the Zhejiang M. tuberculosis isolates and 1154 globally distributed publically available genomes for (a) lineage 2 and (b) lineage 4. Scale bar indicates the regions of origin. The M. tuberculosis sub-lineages, L2.1, L2.2, L2.3, L4.2, L4.4 and L4.5, are indicated respectively.
The distributions of sub-lineages varied between the administrative/geographic regions of Zhejiang Province (East, North, West, South and Middle). The lineage 4 types accounted for the largest proportion in Southern Zhejiang (40.10%), while Western Zhejiang had the lowest proportion (19.57%) of these lineages. Analysis of spatial-temporal trends in the distributions of lineage 2 and 4 isolates among the five districts indicated that the proportion of lineage 4 isolates decreased in Northern and Southern Zhejiang over the 16-year study period, whereas it increased in Western Zhejiang (Supplementary Fig. S1).
Phylogeographic evolution of the major sub-lineages
Published phylogeographic studies have indicated an African origin for Mtb, suggesting that it was introduced to other continents via human migration [9, 19]. To further explore the evolutionary relationship of these strains and their geographical distribution, we used Bayesian evolutionary analysis (Table 1, Fig. 3) to predict the divergence time of the most recent common ancestors of four sub-lineages (Supplementary Fig. S2).
Table 1
The most recent common ancestors of L2 and L4 sub-lineages in China
Summary statistics
|
L2.2
|
L4.2
|
L4.4
|
L4.5
|
Mean (tMRCA)
|
10,763
|
8,530
|
7,800
|
7,446
|
SE of the mean
|
39.5
|
62.7
|
39.4
|
43.0
|
Median (tMRCA)
|
10,740
|
8,499
|
7,770
|
7,435
|
Geometric mean
|
10,711
|
8,456
|
7,747
|
7,406
|
95% HDI
|
[8,729 − 12,836]
|
[6,378 − 10,804]
|
[6,064 − 9,572]
|
[5,900-8,901]
|
ESS
|
711.5
|
323.7
|
531.5
|
319.1
|
tMRCA: the most recent common ancestor; SE of the mean: standard error of the mean tMRCA; HDI: highest posterior density interval; ESS: effective sample size. |
Figure 3. Mutation rates and changes in sub-lineage diversity over time. (a) The mutation rate was estimated using Beast. (b) Bayesian skyline plots indicating changes in the diversity of four sub-lineages over time. Shadowed areas show the 95% HPD (high posterior density) intervals for the population-size estimations.
Our results revealed that L2.2 is the most ancient of the studied sub-lineages in China, with its tMRCA appearing around 10,763 years ago (95% HDI: 8729-12,836 years ago), whereas L4.5 is the most modern of the studied sub-lineages in China, with its tMRCA appearing around 7446 years ago (95% HDI: 5900–8901). As shown in Fig. 3a, the substitution rate of Mtb was found to be a mean of 4.35 × 10− 9 substitutions per genome per site per year [95% HPD interval: 3.58 × 10− 9-5.26 × 10− 9; converted by the calculated annual mutation rate of each polymorphic locus (24,633 loci): ucld.mean = 1.49 × 10− 6].
Given the times of origin for the four sub-lineages in China, the characteristics of the MCC tree (Supplementary Fig. S2), and historical information on the arrival and spread of modern humans in China [19], we propose two possible routes of propagation across China for each of the studied sub-lineages (Fig. 4). For L2.2, one potential route of propagation originates in Xinjiang in Northwest China and spreads to the South and Southeast, while the other originates in Fujian and spreads to the north. For L4.2, one potential route of propagation originates in Qinghai Province in Western China and spreads to the East and Southeast, while the other originates in Heilongjiang Province in Northeast China and spreads to the South. For L4.4, one possible route of propagation originates in Guangdong and Hunan Provinces of Southern China and spreads to the North, while the other originates in Heilongjiang Province and spreads to the South. For L4.5, one possible route of propagation originates in Xinjiang Province and spreads to the East and Southeast, while the other originates in Heilongjiang Province and spreads to the South and Southwest. The origin times of some key propagation points are shown in Fig. 4.
Figure 4. Potential propagation routes of four sub-lineages in China. Shown are routes for L2.2 (a), L4.2 (b), L4.4 (c) and L4.5 (d). The dotted line indicates that the distance is long and the evidence maybe weak (possibly due to a lack of strains). Blue lines indicate older transmission routes, while orange lines indicate more recent transmission routes.
We used a similar method to obtain the divergence times for the MRCAs of the six sub-lineages found in Zhejiang Province. As shown in Table 2, we found that L2.2 is the most ancient of the studied sub-lineages in Zhejiang, with its MRCA appearing around 6 897 years ago (95% HDI: 6513–7298 years), while L4.4 is the most modern of the studied sub-lineages in Zhejiang, with its MRCA appearing around 2217 years ago (95% HDI: 1864–2581 years).
Table 2
The most recent common ancestor of L2 and L4 in Zhejiang
Summary statistics
|
L2.1
|
L2.2
|
L2.3
|
L4.2
|
L4.4
|
L4.5
|
Mean (tMRCA)
|
5,602
|
6,897
|
5,712
|
3,604
|
2,217
|
4,272
|
SE of the mean
|
14.6
|
4.6
|
13.4
|
13.2
|
10.7
|
6.9
|
Median (tMRCA)
|
5,679
|
6,898
|
5,815
|
3,603
|
2,214
|
4,273
|
Geometric mean
|
5,514
|
6,894
|
5,623
|
3,599
|
2,210
|
4,267
|
95% HDI
|
[5,077 − 6,123]
|
[6,513-7,298]
|
[5,202-6,229]
|
[3,220-4,012]
|
[1,864-2,581]
|
[3,841-4,670]
|
ESS
|
207.5
|
1894.6
|
229.8
|
238.9
|
291.6
|
958.1
|
tMRCA: the most recent common ancestor; |
SE of the mean: standard error of the mean tMRCA; |
HDI: highest posterior density interval; ESS: effective sample size. |
Given the origin times of the six sub-lineages in Zhejiang, the characteristics of the MCC tree (Supplementary Fig. S3) and the above-described possible transmission routes of the four sub-lineages in China, we inferred the potential propagation routes for the six sub-lineages in Zhejiang, as shown in Fig. 5. The directions and estimated years at which the strains entered Zhejiang from other regions are basically consistent with the transmission routes of the four sub-lineages (L2.2, L4.2, L4.4 and L4.5) in China. More specifically, L2.1 and L2.3, which derived from 5,700 years ago, might be related to the origin and migration of Liangzhu Culture (about 5,500 years ago), sharing similar original time and geographical distribution [20]. L4.5, deriving from 3,600 years ago, might be related to the Battle of Mingtiao, which was the final battle of the Xia Dynasty (circa 1,600 BC). Shang Tang won the battle and Xia Jie retreated to Nanchao, adjacent to Zhejiang Province [21]. L4.4, deriving from 2,200 years ago, might be related to the war of Qin State destroying Chu State (circa 200 BC). At that time, the territory of Chu included western and southeastern Henan, southern Shandong, Hubei, Hunan, Jiangxi, Anhui, Jiangsu, and Zhejiang. The marching route of Qin destroying Chu was consistent with the transmission route of L4.4 [22]. Moreover, the spread of L2.2 might be related to the origin of Zhejiang's agricultural civilization and the transmission route of L4.5 began from sea, which may be related to the origin of the Maritime Silk Road [23].
Figure 5. Potential propagation routes of six sub-lineages in Zhejiang Province. Shown are L2.1 (a), L2.2 (b), L2.3 (c), L4.2 (d), L4.4 (e) and L4.5 (f). The curves without starting points indicate the directions and years of the strains entering Zhejiang Province from other regions.
Genomic features of lineages 2 and 4
We compared the genetic diversity of the lineage 2 and 4 strains in Zhejiang Province to that of the global strains. As seen in the global strains, there was greater genetic diversity among the lineage 4 strains from Zhejiang Province than among the lineage 2 strains (Fig. 6). Zhejiang lineage 4 strains harbored a mean diversity of 565 SNPs between any two strains, compared to 291 SNPs in lineage 2.
Figure 6. Number of pairwise differences between Mtb strains for lineage 2 and lineage 4. The alignment of 217 human-adapted Mtb clinical strains published previously (Comas et al., 2013) was used to calculate pairwise differences of global strains.
Figure 7. Box plots of pairwise genetic distances (number of polymorphisms) for each sub-lineage.
Our estimation of the genetic diversity among the sub-lineages of lineages 2 and 4 based on the SNP pairwise distances showed that L2.3, the predominant sub-lineage in lineage 2, was significantly more conserved than L2.1 (mean of 202 and 337 SNPs, respectively, shared between isolate pairs; Wilcoxon rank-sum test, P < 0.0001). In lineage 4, we observed the opposite trend, as the predominant sub-lineage, L4.5, was more diverse than L4.2 (mean of 385 and 253 SNPs, respectively; Wilcoxon rank-sum test, P < 0.0001) (Fig. 7).
To assess the genetic diversity of antigens in the lineage 2 and 4 strains, we calculated the non-synonymous to synonymous substitution (dN/dS) ratios for the epitope and non-epitope regions, along with the distribution of amino acid replacements in individual epitopes. We found that the dN/dS ratio of epitope and non-epitope regions exhibited significantly more conservation in lineage 2 strains than in lineage 4 strains. In lineage 2 strains, however, the T cell epitope regions showed significantly higher dN/dS ratios than the non-epitope regions (Fig. 8). When we assessed the evolutionary conservation of human T cell epitopes in the sub-lineages of lineage 2 and lineage 4 (Supplementary Fig. S4), we found that the dN/dS ratios for the sub-lineages of lineage 4 were similar to those of the overall lineage. For the sub-lineages of lineage 2, meanwhile, the dN/dS ratio of the lowest-prevalence sub-lineage, L2.1, differed from that of the overall lineage, whereas the ratios of the other sub-lineages were consistent with those of the overall lineage.
Figure 8. Pairwise ratios for the rates of nonsynonymous to synonymous substitutions (dN/dS) in lineage 2 and 4 isolates, assessing epitope and non-epitope regions of T cell antigens.
When we analyzed the distribution of amino acid replacements in individual epitopes, we found that a large majority (95%) of the 491 T cell epitopes showed no amino acid change (Supplementary Fig. S5). However, lineage 2 had more epitopes that harbored at least one amino acid change, compared to lineage 4. In lineage 2, four epitopes (esxL, lpqH, fbpB and lppX) harbored more than two variable positions.