Data
Our data provide information on those infected and the route of infection of each patient made publicly accessible by the Seoul, Gyeonggi-do, and Incheon local governments in South Korea. Because these three municipalities comprise the Seoul metropolitan area, our data show the situation in the capital region of South Korea. Most of the COVID-19 infections in South Korea occurred in the Seoul metropolitan area and in Daegu and Gyeongsanbuk-do. However, since infections in Daegu and Gyeongsangbuk-do occurred rapidly in the early period (February and March), and the government was not well prepared for gathering data, there is a lack of data on the regions’ route of infection.
Although webpages containing publicly accessible information differ across local governments, the items commonly disclosed are the confirmation identification number, infection routes, date of confirmation for COVID-19 positivity, and the hospital where the infected person is being treated.
We paid particular attention to the route of infection, which comprises a record of key contacts in the infection process identified by the local government and health authorities. This record contains both people and events. If a patient number is recorded as a route of infection for a particular patient, the most perfectly specified is the source of infection. However, if a mass infection occurs in a confined space or a person has returned from a foreign country and is found to be a patient, it is difficult to know who infected whom. In this case, the name of the event or place is recorded instead of the patient number. In other words, the record of the infection route is data that allows us to build infection networks, at least in a limited form.
The specific data collection process is as follows. We created a scraping program that automatically collects relevant information from three local government websites. Because information is presented across many pages, it is difficult for human researchers to collect information individually. After the data were collected using this program, we converted the raw data into structured network data. First, we extracted the link information, and formed a network of infections between individuals. Individuals are nodes, and links are the infection relationships between them. If another patient is identified in the infection path of one patient, a connection between them is assumed. Simultaneously, two properties of all nodes were extracted and recorded: the confirmed date of each patient and the category of the infection path. Infection path categories describe whether an individual patient's path to infection falls under <Personal>, <Group>, <Overseas>, or <Unknown>. In many cases, events or groups are listed on the infection path information page of individual patients. For example, “Patient No. 2000 was infected via a mass infection in Itaewon” and was recorded on the local government's homepage. In this case, the link information cannot be identified because no interpersonal infection information exists. This person's infection path is categorized into <Group>. <Personal> means that a specific patient infected the patient, <Overseas> means a person was infected from abroad, and <Unknown> is a case where the route of infection is unknown.
Finally, our infection network data consisted of patients in the Seoul metropolitan area from January 20 to July 19. The network consists of 3,283 nodes and 1,005 links. Links have direction because infection has direction. The frequency of the node infection path category is as follow: <Personal>: 972, <Group>: 869, <Overseas>: 748, <Unknown>: 694.
Method
We applied three main methods of analysis: network analysis, hypothesis tests on the distributions of network indicators, and virtual structural changes in the network.
First, network analysis refers to calculating various network indicators to obtain previously overlooked structural information. We have paid particular attention to the out-degree of each node that constitutes the Korean COVID-19 infection network, mean distance of the network, and diameter of the network. The key to managing infectious diseases is to reduce people’s contagion power. From the perspective of network science, this information is expressed through three indicators. One is the out-degree of each node, which means how many direct infections the node has produced. The second is the mean distance of a network, which refers to the average path length between all node pairs in the network. We can interpret the mean distance of an infection network as the average potential range of infection. The third is the diameter, which is the length of the longest geodesic in a network. In the context of infectious disease, diameter shows the most extended “Nth transmission” in the network. See Figure 1 for the intuitive meaning of each indicator. We tried to measure the infectivity of the nodes and the infection network using these three indicators. In particular, the mean distance and diameter are indicators based on the whole network. By analyzing how the two indicators change, depending on time and policy, we identified what changes in the whole network structure are observed depending on time and policy implementation.
Figure 1. A small directed network
Second, we conducted several hypothesis tests on the out-degree distribution to determine the features of the distribution. If an out-degree can be a way of measuring the infectious power of nodes, the features of the distributions of out-degrees are also important. That is, health policies should be determined based on the characteristics of the distribution. As Meyer et al. pointed out, the infection status of a society could be different depending on the out-degree distribution [2]. Beyond infection networks, network science has long pointed out that the distribution of degree in many networks contains special features. The discussion and debate on scale-free networks and power law initiated by Barabasi is representative [6,7]. If out-degree follows the power law, it is not helpful to count on the central tendency, such as the mean of out-degree, which results in rethinking many health policies based on the average trend.
We tested whether the out-degree in the COVID-19 network of South Korea follows the power law. To this end, we followed the procedure proposed by Clauset et al. [8]. First, we estimated the parameters of the power-law distribution, assuming that the out-degrees of nodes were based on the power-law distribution. Then, using bootstrap, we calculated the distances between the 3,000 sets of data generated from the estimated distribution and the distribution itself. The 3,000 distance values represent random fluctuations that the data would show when they follow the power-law distribution. Then, we compared the distances with the distance between our actual data and the estimated power-law distribution. This determines how many times the distances based on simulated data are farther than the distance between our real data and the estimated distribution. Using the results, we analyzed whether the null hypothesis that the out-degrees of nodes follow the power law could be rejected. The Kolmogorov–Smirnov statistic was used to calculate the distance. Finally, the explanatory power of the power-law distribution was compared with other distributions, which could be an alternative model for fitting a heavy-tailed distribution.
Third, the virtual structural changes in the network were used to estimate the expected effects of network-based health policies. If the health authorities had the network information perfectly, they would have controlled the most infectious node first in the overall infection network. If successful, it would have had the effect of isolating and eliminating the nodes in the measured infection network. We observed how the overall structure of the infection network and related indicators changed by removing the top 1% or 5% nodes on the out-degree. This gives us an idea of the expected effect of health policies using network information.
All the analyses explained above were performed using R [9] and its packages, including the following: “tidyverse” [10] (for data wrangling and visualization), “igraph” [11] (for network analysis), “lubridate” [12] (for handling date), “ggraph” [13] (for visualization), “slam” [14] (for data wrangling), and “PoweRlaw” [15] (for analyzing power law).
Seoul National University Institutional Review Board approved this study (IRB No. E2009/003-001, Results of review: Exemption).