Genome sequence processing
The genome sequences of SARS-CoV-2 were downloaded from the Global Initiative of Sharing All Influenza Data (GISAID, https://www.gisaid.org/) on January 25, 2021. The dataset contains 279,411 samples which were sampled from December 24, 2019 to January 21, 2021. Firstly, we used the data collected in the first three months (i.e., December 24, 2019 to March 22, 2020) for reconstructing the early transmission network of SARS-CoV-2 (including 19,187 samples, covering 95 countries or regions). Secondly, to repeat and validate our results, we used data covering the first 6 months (including 84,243 samples, covering 112 countries or regions). These samples represent the early detected samples in the world covering many countries. Because the results are similar, thus, we mainly reported the results of the first 3 months, but discussed the results of first 6 months. The genome sequence of BANAL-52 (GISAID accession number: EPI_ISL_4302644) was downloaded from GISAID and used as a reference for identifying mutations of SARS-CoV-2. We aligned these genome sequences to the reference sequence (Wuhan-Hu-1, GISAID: EPI_ISL_402125) using Muscle-V5. To minimize the potential impacts of sequencing errors, nucleotides at the 5' UTR (sites 1‒265) and 3' UTR (sites 29675‒29903) were excluded.
Reconstruction of the transmission network
We clustered all samples based on sequence differences. Samples with the same genome sequence were assigned into a node of transmission chains. Each node in the transmission chains represents a unique sequence with distinct mutation sites, which is often composed of multiple samples from the same or different places. We reconstructed the transmission network of SARS-CoV-2 based on differences of BANAL-52-referenced mutations (i.e., BANAL-52-ref mutations) in all sites of each node. The transmission network is composed of transmission chains. The core process of reconstructing each transmission chain is to find the closest ancestor node of each node according to the following mutation model (Fig. 2A).
We defined mutations that occurred after the emergence of the most recent common ancestor (MRCA) of SARS-CoV-2 as the de novo mutations. Since it is impossible to directly determine the de novo mutations because the common ancestor is unknown, we inferred the mutations by using BANAL-52 as a reference (for details, see Reference Selection Section). We assumed that sequence S0 is the MRCA of the other sequences (i.e., S1, S2, S3.1, S3.2, S4.1, S4.2) in Fig. 2. We used the following steps to reconstruct a transmission chain or network of SARS-CoV-2 based on its ancestry relationship (ancestor-offspring relationship) of de novo mutations (Fig. 2):
(1) For each node, we identified its ancestor nodes. We performed a pairwise comparison of the sequences. The de novo mutations contained in a node X is set MX = {m1, m2, ..., mn}, then, for node U and V, if MU ∈ MV, then U is an ancestor node of V. The offspring node should have more mutations than its ancestor node, that is, the mutations of the offspring node must include all the mutations of its ancestor node. For example, in Fig. 2A, S0, S1, S2, S3.1 are the ancestor nodes of S4.1, and S0, S1, S2, S3.2 are the ancestor nodes of S4.2.
(2) For each node (i.e., a unique sequence), we identified its closest ancestral node from all ancestor nodes. For all ancestor nodes identified in step (1), we selected the node which has the closest similarity of mutations to the focal node as its closest ancestral node. For example, in Fig. 2A, S2 is the closest ancestral node of S3.1 and S3.2; S3.1 is the closest ancestral node of S4.1; S3.2 is the closest ancestral node of S4.2.
(3) We connected all nodes with their closest ancestral nodes to form a transmission chain (Fig. 2B) or network (Fig. 2C). A transmission network indicates several transmission chains which share a common root node.
By following the above procedure (1), (2), and (3), ancestral nodes of all nodes were determined, and the transmission chains were finally reconstructed (Fig. 2). Because some chains would share a common node (Fig. 2B), thus, we reconstructed the transmission network by merging chains sharing the common node (Fig. 2C).
The sample clustering (i.e., identification of node) and reconstruction process of transmission chains and the networks were implemented by custom scripts in Python-3.7.
Reference selection
To determine the de novo mutations of a node, we need to select a reference sequence. Because the most recent common ancestor of SARS-CoV-2 (i.e., MCRA in Fig. 2A) is still unknown, we have to choose a detected sequence that is close to SARS-CoV-2 as the reference sequence.
As shown in Fig. 3, if we use a sequence of SARS-CoV-2 which is an offspring of SARS-CoV-2, the mutations inferred from the reference may be incorrect. For example, according to Fig. 3, using the de novo mutations, the correct transmission chain based on the ancestor-offspring relationship defined above should be L1: S0→S1→S2→S3→S4→S5→S6. But if we select an offspring as the reference (e.g., S3), the identification of S3 inferred mutation (S3-ref mutation) sites may be incorrect for some sequences (i.e., cyan A in L2), resulting in two short chains (L2, L3). The transmission direction or ancestor-offspring relationship after S3 is still correct (i.e., L3), but incorrect for sequences before S3 (i.e., L2). Some S3-ref mutations (cyan T) in L3 are correct de novo mutations, while other S3-ref mutations (cyan A) are not.
Alternatively, we selected an earlier but the closest relative sequence (i.e., BANAL-52) to SARS-CoV-2 as the reference to infer the mutations in SARS-CoV-2 (i.e., BANAL-52-ref mutations), and assessed its potential in reconstruction of transmission chains. BANAL-52 was sampled in 2020, it is a closest relative to SARS-CoV-2 with an about 3.2% difference with SARS-CoV-2 in the whole genome [6]. As shown in Fig. 4, we assume that, as compared to the MRCA between SARS-CoV-2 and BANAL-52 (i.e., MRCA-S-R), BANAL-52 has a mutation (orange T) at site 10 (from left to right), ancestor of MRCA of SARS-CoV-2 has a mutation at site 8 (orange C), MRCA of SARS-CoV-2 (i.e. MRCA-S) has two mutations (orange C,G) at site 8 & 9; as compared to MRCA-S, SARS-CoV-2 has 1–6 de novo mutations (boxed red T) at site 1–6, and two inherited ancestor mutations (orange C,G) from MRCA-S and ancestor of MRCA-S at site 8 & 9. If we use the MRCA-S as the reference, the transmission chain of SARS-CoV-2 should be reconstructed as L0: S1→S2→S3→S4→S5→S6. If we use BANAL-52 as the reference, the de novo mutations would be identified correctly in L1 (i.e., boxed cyan T at site from 1–6) which are same to the red and boxed T in L0. The ancestor mutations of SARS-CoV-2 in L1 (cyan C, G) would be correctly identified. But mutation of BANAL-52 (orange T) would be identified as an incorrect mutation in L1 (cyan A) because the mutation of the BANAL-52 reference results in difference of base at site 10 between BANAL-52 and SARS-CoV-2. Because both inherited ancestor mutations of SARS-CoV-2 (site 8,9) and incorrectly inferred mutation using BANAL-52 (site 10, named as incorrect mutation) would be carried by all samples of SARS-CoV-2, they are not considered in the reconstruction of the transmission chain based our method if no secondary mutation occurs on these sites (see below). Using the BANAL-52-ref mutations, the transmission chain can be reconstructed as L1: S1→S2→S3→S4→S5→S6, which is same to S0 (Fig. 4). This lays the solid foundation of reconstructing the transmission chains or networks using BANAL-52 as the reference. In Fig. 4, we introduced mutations of C, G to illustrate ancestor mutation of SARS-CoV-2.
Model errors
Because BANAL-52-ref mutations are used to reconstruct the transmission chains or networks, secondary mutation at the BANAL-52-ref mutation sites in L1 (Fig. 4) would cause errors in reconstruction of transmission chains and networks. If the secondary mutation changes the base back into its original base, it is called a back mutation. In our model, there are three kinds of BANAL-52-ref mutations in L1: de novo mutations (cyan T), ancestor mutations (cyan C, G) and incorrect mutations (cyan A) which were used for reconstructing the transmission chains or networks (Fig. 4). If there is no secondary mutation occurring on these mutation sites of SARS-CoV-2 during the study period, the reconstructed transmission chain (L1) is same to the true one (L0) (Fig. 4); otherwise, it will cause errors in reconstructing the transmission chains or networks (see Fig. 5).
As shown in Fig. 5, a secondary mutation on the BANAL-52-ref mutation sites during the study period of three months would cause biases in the reconstruction of transmission chains. If no secondary mutation occurs, the original transmission chain should be: L0 = S1→S2→S3→S4→S5→S6. If the secondary mutation (T→A, here, it is a back mutation) occurs in the de novo mutation sites of a sequence (S4), and the sequence has no extra copies, the mutated sequence (S4b in Fig. 5B2) will become an isolated chain (L1.2n = S4b, Fig. 5B2), while the relationship of other samples on the chain remains unchanged except for the absence of S4, i.e., L1.1n = S1→S2→S3→S5→S6 (Fig. 5B2). However, if S4 has extra copies, it will not affect the original chain: L1.1c = S1→S2→S3→S4→S5→S6, while the secondary mutation would produce an isolated chain: L1.2c = S4b (Fig. 5B1). If the secondary mutation occurs in the de novo mutation sites of a sequence (S4), and the mutated sequence (S4b) is the same as its ancestor sequence (S3), this will cause no change of original chain if S4 has extra copies (i.e., L2c = S1→S2→S3→S4→S5→S6) or it would produce a shorter chain with the absence of S4 (i.e., L2n = S1→S2→S3→S5→S6) (Fig. 5C). If the secondary mutation occurs in the site corresponding to the BANAL-52 mutation site (cyan A →purple T, here, it is not a back mutation because cyan A is an incorrect mutation) and the sequence (S4) has extra copies, the mutated sequence (S4b) would become an independent chain: L3.2c = S4b, while the relationship of the other samples on the chain remains unchanged: L3.1c = S1→S2→S3→S4→S5→S6 (Fig. 5D1). However, if the sequence (S4) has no extra copies, it would break the original chain into two chains with the same probability: L3.1n = S1→S2→S3→S5→S6, L3.2n = S4b→S5→S6 (Fig. 5D2). Secondary mutation in the ancestor mutation sites (i.e., C, G in Fig. 4) would have similar results to cyan A (for simplification, we did not show these sites in Fig. 5). Because the proportion of sequence with two or more secondary mutations was extremely low in this study, thus, we only presented illustrations of one secondary mutation in Fig. 5.
Considering the model error which could alter the ancestral-offspring of about 20% sequences of the first 3 months (higher for sequences of first 6 months) (For details, see below), we emphasized on analysis and discussion of general transmission patterns (such as number of chains, networks ect.) of SARS-CoV-2, not on the specific sequences.
Error probability caused by secondary mutations
If there is no secondary mutation during the study period of three months, the transmission chain (L1) is same to the true one (L0) (Fig. 4). However, if secondary mutation occurs on these sites of BANAL-52-ref mutations, error of reconstructed transmission chain would arise. The BANAL-52-ref mutations are composed of de novo mutations (cyan T), ancestor mutations (cyan C, G) and incorrect mutations (cyan A) in L1 (Fig. 4).
In theory, the probability of secondary mutation on de novo mutations (pd) is determined by the mutation probability of the de novo mutations of SARS-CoV-2 within three months. The mutation rate was assumed as 6 × 10− 4 substitutions per site annually[12], and the mutation probability of SARS-CoV-2 within 3 months was calculated as: PT = 6 × 10− 4 / 4 = 1.5 × 10− 4. The proportion of de novo mutation within three months can be assumed to be 1.5 × 10− 4. Therefore, the error probability of secondary mutation on de novo sites was calculated as: pd = (PT)2 = (1.5 × 10− 4)2 = 2.25×10− 8. The probability of samples with such secondary mutations (i.e., number of secondary mutation ≥ 1) was estimated as 1-(1-p)n (p is the secondary mutation rate, n is the number of bases of SARS-CoV-2). Let pd = 2.25×10− 8, n = 29,410, then, the probability of samples with such secondary mutations was estimated to be 6.6×10− 4. Thus, we predict about 0.066% of samples with such secondary mutations on de novo mutation sites would be reconstructed into short chains. Because each node sequence has 2.4 copies (= 19187/7918) in our study, the probability of breaking the original chains was calculated as: (6.6×10− 4)2.4 = 2.3×10− 8, which is very small.
BANAL-52 has about 3.2% difference with MRCA of SARS-CoV-2 in the whole genome (PR), which is likely composed of ancestor mutation sites (cyan C, G) and incorrect mutation site (cyan A). Secondary mutation on these sites would cause model errors in reconstructing the transmission chains or networks of SARS-CoV-2. The mutation rate during three months of study period is assumed as: PT = 1.5 × 10− 4 substitutions per site within three months. Thus, the error probability caused by secondary mutations on ancestor and incorrect mutation sites was calculated as: pa = PR × PT = 3.2 × 10− 2 ×1.5 × 10− 4 = 4.8 × 10− 6. The probability of a sample that has secondary mutations on ancestor or BANAL-52 mutation sites was estimated as 1-(1-p)n (p is the total error probability caused by secondary mutations on mutation sites of ancestor of SARS-CoV-2 and BANAL-52, n is the number of bases of SARS-CoV-2 genome). Let p = 4.8×10− 6, n = 29,410, then, the probability of a sample that has secondary mutations on ancestor and incorrect mutation sites (cyan C, G, A in Fig. 4) was estimated to be 13.2%. Because each node sequence has 2.4 copies in our study, the probability of breaking the original chains was calculated as: (13.2%)2.4 = 0.78%. Thus, using BANAL-52 as reference would have little influence on the original transmission chains or network, although it may produce 13.2% short transmission chains.
Error probability caused by sequence gap or uncertainty
Similar to secondary mutations, degenerate or gaps (missing bases in the genome sequence) due to sequencing error or uncertainty may cause biased estimation of reconstruction of the evolutionary tree. These uncertain bases are often treated as none mutations in evolutionary studies (same in our study). In our study, there were 1.9 degenerate (or missing) bases per sequence on average. Thus, the proportion of degenerate or missing bases was calculated as: PD = 2.9/29410 = 9.8 × 10− 5. Therefore, the model error probability caused by degenerate or missing bases on de novo mutations was calculated as: ps = PD × PT = 9.8 × 10− 5 × 1.5 × 10− 4 = 1.47 × 10− 8. Similarly, the model error probability on ancestor and incorrect mutations was calculated as: ps = PD × PR = 9.8 × 10− 5 × 3.2 × 10− 2 = 3.1 × 10− 6. The probability of samples with degenerate or missing bases (i.e. number of degenerate or missing bases ≥ 1) was estimated as 1-(1-p)n (p is the degenerate mutation rate, n is the number of bases of SARS-CoV-2). Let p = 1.47 × 10− 8, or, 3.1 × 10− 6, n = 29,410, then p = 4.3×10− 4, or, 8.7%. Thus, we predict a total of 8.7% samples would be reconstructed into short transmission chains caused by degenerate or missing bases. Because each node sequence has 2.4 copies in our study, the probability of breaking the original chains by degenerate base was calculated as: (8.7%)2.4 = 0.29%. Therefore, degenerate mutations or missing bases would have little influence on the reconstruction of original transmission chains or network, but they would produce 7.13% short transmission chains.
Simulation analysis
To validate our method of reconstructing the transmission network of SARS-CoV-2 based on the paternity relationship as described above, we simulated the occurrence of mutations based on the way that the virus sequence replicates in nature. We chose a sequence with a length of 1000 bp as the starting sequence to simulate the proliferation process of this sequence. Each sequence produced three progeny sequences in one replication cycle (one generation). We set the mortality rate of sequences as 20%, and the mutation rate as 0.1% for each base within a replication cycle. We chose a shorter sequence and a higher mutation rate for saving computation time.
We simulated the evolution process of a sequence for ten generations and chose the first sequence in the 10th generation as a template to produce six generations as the detected samples (similar to the detected SARS-CoV-2 samples after late December of 2019) for subsequent analysis (Fig. S2). We chose the last sequence in the 9th generation as the reference sequence (similar to BANAL-52) (Fig. S2). We selected three-group data from the detected samples: (1) the detected samples from 0th to 6th generation (the 0th generation represents the starting sequence of the simulated data) which includes the common ancestor, (2) the detected samples from the 2nd to 6th generation with missing data (similar to the undetected data) from 0th to 2nd generation, and (3) the detected samples from the 4th to 6th generation with missing data from 0th to 3rd generations. By using these three-group data, we reconstructed the transmission chains and networks based on the paternity relationship, tested the three hypotheses (Fig. 1), and validated the method we used in this study (Table 2).
The simulation process was implemented by custom scripts in Python-3.7.