Community Detection using Unsupervised machine learning technique on COVID -19 dataset

doi:10.21203/rs.3.rs-74143/v1

Download PDF

Research Article

Community Detection using Unsupervised machine learning technique on COVID -19 dataset

https://doi.org/10.21203/rs.3.rs-74143/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 09 Mar, 2021

Read the published version in Social Network Analysis and Mining →

Version 1

posted

You are reading this latest preprint version

COVID-19 has been considered to be the most destructive pandemic ever happened in the history of mankind. The worldwide research community has put a tenacious effort to carry out research on the COVID-19 to analyse its impact on economic, medical and sociolgoical fields. They are trying to solve many crucial issues related to this disease and derive strategies to deal with this global pandemic. In this paper, we have analysed the trend, countries affected regionally and the variation of cases at the country level on COVID-19 dataset. We have used the Principal component analysis on the COVID-19 dataset variables to reduce the dimentionality and find the most significant variables. Further, we have unvieled the hidden community structure of countries by applying the unsupervised clustering approach, K-means. The resulted communities can be an advantage to researchers, politicians, sociologists, different policy makers and managers of health sector.

Artificial Intelligence and Machine Learning

COVID-19

coronavirus

K-means

PCA

communities

machine learning

The coronavirus COVID–19 is rapidly spreading across the world since the beginning of 2020. WHO (World Health Organization) categorizng it as a global pandemic [1] due to it is highly contagious natue. In the current global pandemic situation, all the countries are struggling with COVID–19 and still looking for a cost-effective and practical solution to encounter the challenges arising in various ways. Researchers from different fields such as engineering and physical sciences are attempting to take such challenges, to develop new theories, and to generate user-centred solutions [2].

The recent studies on COVID–19 have mainly focused on the analysis at individual level, that is based on its attributes and symptoms of this disease. The studies on the various geographical areas and huge populations have been inconsequential till now. Hence, there is a significant scope for research in this area other than the research done on patients information [3].

In this paper, the analysis at the country-level COVID–19 dataset could provide potentially modifiable related factors that indiviual level dataset is not able to uncover due to the limited variables. Furthermore, the analytical approach such as unsupervised machine learning for community detection is used to analyse the behavior of countries during the COVID–19 global pandemic. Hence, the community detection method helps in unviel the patterns of countries and regions where the COVID–19 have impacted in a similar pattern. Regions and countries could use this pattern and information to prohibit worst situation.WHO and other global organization could use these cluster information to give countries a similar aid. Therefore, we have developed a unsupervised machine learrning method, PCA (Principal component analysis) [4] and K-means clustering [5] on the country level COVID–19 pandemic dataset, that can detect communties of countries regarding the country level variables. It means, we aimed to find out: the significant variables of the COVID–19 pandemic by applying the PCA and then we have detected the communities of countries using essential characteristics.

In this research paper, our contribution are as follows:

The analysis of the country level COVID–19 dataset, which helps in understanding the count of the countries and affected in various WHO regions.
The analysis of the top four most affected countries gives better idea how the various cases are varying at country-level.
The PCA helps in finding the most significant patterns in the data that helps in reducing the number of dimensions without losing much of the information.
The community detection approach grouped the countries, that helps in objectively distinguish countries and regions with respect to the COVID–19 dataset spread and results.
Different policy makers, for instance managers and physicians from the health sector, sociologists and politicians, make use of these analyses.

The COVID–19 dataset is collected from the official website of Johns Hopkins University [6]. It consists the number of cases from January 22^nd, 2020 to August 15^th, 2020. The Excel 2019 have been used to collect and integrate the dataset. The final dataset can be retrieved from [7]. The retrieved country level data is recorded into excel and further analysed. The country level dataset have the data of 187 countries and the number of variables, 15.

Fig.1. shows the number of countries impacted due to COVID–19 with respect to the WHO regions. We can see that a total of 56 countries in European and 48 countries in African region is highly affected due to coronavirus. Fig.2. shows the percentage of total deaths due to COVID–19 corresponding to the WHO regions . It indicates that the American regions have been most affected with 54.16% deaths followed by European regions with 28.45%. The total death percentage is less in African regions inspite of having second most affected region with 48 countries. Furthermore, the South-east asia region has ten countries impacted due to COVID–19 which is the lowest as compared to other regions, however, it has higher death percentage than the Western Pacific region which has 16 countries affected.

Fig.3. shows the percentage of cases in the Top 4 most affected countries. It provides the percentage of confirmed, deaths, recovered and active cases from Brazil, US, India and Russia. The US is the hardest hit country due to the COVID–19 pandemic. The total death percentage is least in Russia as compared to other countries.

In this section, we have described the Principal components method and the K-means method to identify the communities of countries affected from COVID–19 dataset based on the cases. For implementation and visualization of the dataset, we have used Python 3.7 and Tableau 2019.

3.1 Dimensinality Reduction using PCA

The Principal component analysis (PCA) is an essential approach for the pattern analysis of the data. After finding the patterns, It reduces the dimensions of the data without loosing much of the information. PCA maximizes the variance of the projected data and minimizes the squared error between datapoints and their projections.The COVID–19 data consists of 13 continuous variables and some of these variables are highly correlated. Due to the orthogonal transformation, the PCA turns strongly related variables into uncorrelated variables. The PCA has been helpful in creating characteristics set, that illustrates the related information from the COVID–19 data of 13 variables. This is a representation of the number of variable reductions and maximization of the variance.

Given the COVID–19 data, we first start by standardizing the data because to create the clustering model the variables need to have other values between them, hence each one of them have a different variance. Due to this fact, it is significant to normalize these variables to find reliable communities with retaining relevant information. Then we have to generate the covariance matrix for all the variables. Covariance can be computed using formula,

where 𝑋_𝑖 is the data points of the variable, n is the number of datapoints in the dataset and 𝑋̅ is the mean and is given by. Similarly, 𝑌_𝑖 is defined. Using the covariance values, we have constructed the covariance matrix, From the covariance matrix, we have calculated the eigenvalues and eigenvectors. Thus, according to this process of considering the covariance matrix’ eigenvectors, we have been able to get the lines that characterizes the data. The eigenvector with the largest eigenvalues is the principal components of the dataset. So, we order the eigenvalues in descending order. It gives siginificance order of principal components. Fig.4. shows the cumulative variance of the principal components. The eight principal components have contributed the 99.9 cumulative variance. The eight PCA components preserved a variance of almost 1. Hence, we have taken eight principal components and ignored the other components of lesser significance. This approach of acquiring 100% as an explained variance signifies that retaining 100% of the information explained by the original 13 variables. Additionally, these eight components provided the most accurate communities as explained in the next section.

Algorithm1: PCA Algorithm
Input	:	COVID-19 dataset, K (number of clusters)
1	:	Standardize the dataset.
2	:	Generate the covariance matrix using the covariance values using (1).
3	:	Compute eigenvalues (which are the magnitudes of variance captured) and eigenvectors (which are the principal components).
4	:	Sort the eigen pairs in descending order of eigenvalues and select the largest eigenvalues components which captures the variance of 1.
Output	:	Reduced new dataset.

3.2 Community Detection using centroid based K-means approach

Now after obtaining the dataset, we have applied unsupervised clustering method, K-means on this dataset. K-means method uncover the communities from the heterogeneous elements and clusters them into homogeneous groups. It groups the elements into clusters that were undefined at the beginning of the analysis. This methodology has been used earlier various sectors such as clinical and public health research sectors. The different methods of unsupervised clustering depend on the characteristics of the dataset. In this research work, we have taken centroid based K-means algorithm. It is suitable for communities that are in similar densities, similar size, and have a globular shape.

K-means method requires the information about K (number of clusters) as the input. Therefore, we have used the Elbow method [8] to identify the optimal number of clusters, K. The elbow method measures the homogeneity or heterogeneity within communities as the number of

clusters changes. Fig. 5 shows the elbow function plot for the communities which maximizes the function convergence to the centroids. The dataset from the countries have been paired to support the selection of six and seven clusters. The selection of six and seven cluster of the countries in the group is based on the one who shares common socio-demographic and epidemiological profiles.

The number of clusters around the elbow function delivers almost similar information due to the limited number of observations done in this analysis. Visual analysis of maps and plots has been utilized to make a crucial decision of the number of communities to obtain the best output by grouping countries in a stable cluster of same background.

The retrospect analysis recommends to select six and seven clusters for K means method. During the visual inspection of the maps, geographical, epidemiological, and geopolitical knowledge have been used as input variables. Community detection have been done using six and seven clusters and they have given productive results based on prior knowledge. Overall, our decision to go with six and seven clusters has been a good selection. Hence, we have detected the communities using K-means method after finalizing the number of clusters. The steps of the K-means method are described in Algorithm1.

Algorithm1: K means Method
Input	:	Normalized reduced new dataset, K (number of clusters)
1	:	Choose, K (achieved from Elbow method) as initial centroid.
2	:	Construct K communities by assigning all data points to the nearest centroid.
3	:	Update centroids by calculating central data points of clusters.
4	:	Iterate above two steps until no datapoint is reassigned to another community.
Output	:	Communities of countries.

Fig.6. shows the communities obtained using the K-means (K = 6) approach. It shows that US and Brazil in cluster2, Seychelles in cluster5, UK and Netherlands in cluster6 and rest of the countries in other clusters. Fig.7. shows the count of the countries communities regionally using K-means (K = 6). It shows cluster1 and cluster3 countries is overlapping with all the regions due to COVID–19 cases. Cluster2 is belonging to Americans regions only (US and Brazil). Cluster4 is overlapping with four regions. Cluster6 is belonging to European region only.

Fig.8. shows the communities obtained using the K-means when K = 7. It shows that US in cluster6, Seychelles in cluster5, Brazil in cluster3, UK and Netherlands cluster2 and rest of the countries in other clusters. Fig.9. shows the count of the countries regionally in communities using K-means (K = 7). It shows cluster1 is overlapping with four regions. Cluster2 is belonging to European regions only (UK and Netherlands). Cluster4 and cluster7 countries is overlapping with all the regions. Cluster5 is belonging to African region. Cluster6 is belonging to American region.

In this research paper, we have analysed the trend of the countries affected regionally and also analysed the variation of cases at the country level on COVID–19 dataset. We have used the unsupervised machine learning approach, Principal component analysis on the COVID–19 dataset variables to reduce the dimentionality by covering the variance of 100% and find the most significant variables. Further, we have detected the hidden community structure of countries by applying the other unsupervised approach, K-means. The communities of countries obtained can be beneficial in making various policies in health sector, also this information can help physicians and economy experts. It could also be helpful for countries and regions which belong to the same communities to provide similar aid also in taking preventive measures to avoid worst-case scenarios.

Acknowledgement

This research was funded by NFOBC fellowship of University Grants Commission under the Ministry of Human Resource Development (Government of India).

Competing interests:

The authors declare no competing interests.

[1]. WHO. Briefing by WHO Director-General Tedros Adhanom Ghebreyesus. March 11, 2020.(accessed at: https://www.pscp.tv/w/1djxXQkqApVKZ).

[2]. Singh, Ravi Pratap, et al. "Internet of things (IoT) applications to fight against COVID-19 pandemic." Diabetes & Metabolic Syndrome: Clinical Research & Reviews (2020).

[3]. Carrillo-Larco, Rodrigo M., and Manuel Castillo-Cara. "Using country-level variables to classify countries according to the number of confirmed COVID-19 cases: An unsupervised machine learning approach." Wellcome Open Research 5.56 (2020): 56.

[4]. Shlens, Jonathon. "A tutorial on principal component analysis." arXiv preprint arXiv:1404.1100 (2014).

[5]. Figueiredo, Mario A. T., and Anil K. Jain. "Unsupervised learning of finite mixture models." IEEE Transactions on pattern analysis and machine intelligence 24.3 (2002): 381- 396.

[6]. https://github.com/CSSEGISandData/COVID-19

[7]. https://github.com/imdevskp/covid_19_jhu_data_web_scrap_and_cleaning

[8]. Marutho, Dhendra, Sunarna Hendra Handaka, and Ekaprana Wijaya. "The determination of cluster number at k-mean using elbow method and purity evaluation on headline news." 2018 International Seminar on Application for Technology of Information and Communication. IEEE, 2018.

Download PDF

Journal Publication

published 09 Mar, 2021

Read the published version in Social Network Analysis and Mining →

Version 1

posted

You are reading this latest preprint version

Community Detection using Unsupervised machine learning technique on COVID -19 dataset

Status:

Journal Publication

Version 1

Abstract

Figures

1. Introduction

2. Covid-19 Dataset Description And Analysis:

3. Unsupervised Machine Learning For Community Detection On Covid-19

3.1 Dimensinality Reduction using PCA

3.2 Community Detection using centroid based K-means approach

4. Conclusion

Declarations

Acknowledgement

Competing interests:

References

Status:

Journal Publication

Version 1