In this section, we have described the Principal components method and the K-means method to identify the communities of countries affected from COVID–19 dataset based on the cases. For implementation and visualization of the dataset, we have used Python 3.7 and Tableau 2019.
3.1 Dimensinality Reduction using PCA
The Principal component analysis (PCA) is an essential approach for the pattern analysis of the data. After finding the patterns, It reduces the dimensions of the data without loosing much of the information. PCA maximizes the variance of the projected data and minimizes the squared error between datapoints and their projections.The COVID–19 data consists of 13 continuous variables and some of these variables are highly correlated. Due to the orthogonal transformation, the PCA turns strongly related variables into uncorrelated variables. The PCA has been helpful in creating characteristics set, that illustrates the related information from the COVID–19 data of 13 variables. This is a representation of the number of variable reductions and maximization of the variance.
Given the COVID–19 data, we first start by standardizing the data because to create the clustering model the variables need to have other values between them, hence each one of them have a different variance. Due to this fact, it is significant to normalize these variables to find reliable communities with retaining relevant information. Then we have to generate the covariance matrix for all the variables. Covariance can be computed using formula,
where 𝑋𝑖 is the data points of the variable, n is the number of datapoints in the dataset and 𝑋̅ is the mean and is given by. Similarly, 𝑌𝑖 is defined. Using the covariance values, we have constructed the covariance matrix, From the covariance matrix, we have calculated the eigenvalues and eigenvectors. Thus, according to this process of considering the covariance matrix’ eigenvectors, we have been able to get the lines that characterizes the data. The eigenvector with the largest eigenvalues is the principal components of the dataset. So, we order the eigenvalues in descending order. It gives siginificance order of principal components. Fig.4. shows the cumulative variance of the principal components. The eight principal components have contributed the 99.9 cumulative variance. The eight PCA components preserved a variance of almost 1. Hence, we have taken eight principal components and ignored the other components of lesser significance. This approach of acquiring 100% as an explained variance signifies that retaining 100% of the information explained by the original 13 variables. Additionally, these eight components provided the most accurate communities as explained in the next section.
Algorithm1: PCA Algorithm
|
Input
|
:
|
COVID-19 dataset, K (number of clusters)
|
1
|
:
|
Standardize the dataset.
|
2
|
:
|
Generate the covariance matrix using the covariance values using (1).
|
3
|
:
|
Compute eigenvalues (which are the magnitudes of variance captured) and
eigenvectors (which are the principal components).
|
4
|
:
|
Sort the eigen pairs in descending order of eigenvalues and select the largest
eigenvalues components which captures the variance of 1.
|
Output
|
:
|
Reduced new dataset.
|
3.2 Community Detection using centroid based K-means approach
Now after obtaining the dataset, we have applied unsupervised clustering method, K-means on this dataset. K-means method uncover the communities from the heterogeneous elements and clusters them into homogeneous groups. It groups the elements into clusters that were undefined at the beginning of the analysis. This methodology has been used earlier various sectors such as clinical and public health research sectors. The different methods of unsupervised clustering depend on the characteristics of the dataset. In this research work, we have taken centroid based K-means algorithm. It is suitable for communities that are in similar densities, similar size, and have a globular shape.
K-means method requires the information about K (number of clusters) as the input. Therefore, we have used the Elbow method [8] to identify the optimal number of clusters, K. The elbow method measures the homogeneity or heterogeneity within communities as the number of
clusters changes. Fig. 5 shows the elbow function plot for the communities which maximizes the function convergence to the centroids. The dataset from the countries have been paired to support the selection of six and seven clusters. The selection of six and seven cluster of the countries in the group is based on the one who shares common socio-demographic and epidemiological profiles.
The number of clusters around the elbow function delivers almost similar information due to the limited number of observations done in this analysis. Visual analysis of maps and plots has been utilized to make a crucial decision of the number of communities to obtain the best output by grouping countries in a stable cluster of same background.
The retrospect analysis recommends to select six and seven clusters for K means method. During the visual inspection of the maps, geographical, epidemiological, and geopolitical knowledge have been used as input variables. Community detection have been done using six and seven clusters and they have given productive results based on prior knowledge. Overall, our decision to go with six and seven clusters has been a good selection. Hence, we have detected the communities using K-means method after finalizing the number of clusters. The steps of the K-means method are described in Algorithm1.
Algorithm1: K means Method
|
Input
|
:
|
Normalized reduced new dataset, K (number of clusters)
|
1
|
:
|
Choose, K (achieved from Elbow method) as initial centroid.
|
2
|
:
|
Construct K communities by assigning all data points to the nearest centroid.
|
3
|
:
|
Update centroids by calculating central data points of clusters.
|
4
|
:
|
Iterate above two steps until no datapoint is reassigned to another community.
|
Output
|
:
|
Communities of countries.
|
Fig.6. shows the communities obtained using the K-means (K = 6) approach. It shows that US and Brazil in cluster2, Seychelles in cluster5, UK and Netherlands in cluster6 and rest of the countries in other clusters. Fig.7. shows the count of the countries communities regionally using K-means (K = 6). It shows cluster1 and cluster3 countries is overlapping with all the regions due to COVID–19 cases. Cluster2 is belonging to Americans regions only (US and Brazil). Cluster4 is overlapping with four regions. Cluster6 is belonging to European region only.
Fig.8. shows the communities obtained using the K-means when K = 7. It shows that US in cluster6, Seychelles in cluster5, Brazil in cluster3, UK and Netherlands cluster2 and rest of the countries in other clusters. Fig.9. shows the count of the countries regionally in communities using K-means (K = 7). It shows cluster1 is overlapping with four regions. Cluster2 is belonging to European regions only (UK and Netherlands). Cluster4 and cluster7 countries is overlapping with all the regions. Cluster5 is belonging to African region. Cluster6 is belonging to American region.