In this analysis, we have taken Covid-19 data from Karnataka, India. The dataset for this study was downloaded from Kaggle [8]. In this dataset, a list of covid-19 cases of each state is provided, but most of the attribute values were missing. For this study, we are mainly focused on three attributes: gender, age and current status (recovered, hospitalized and deceased). This dataset has many missing values, and directly applying analysis on the dataset is not possible because it will not provide accurate results and a high chance of biasedness.
Therefore, we first perform data preprocessing. In this step, we check for missing values based on state and check that there is very missing value for some particular time interval. Based on these two conditions, we select Karnataka. After selecting the target data, we thoroughly analysed the data and finalized our research question.
Research Question
- Is there any relationship between gender and patient status?
- Is there any relationship between patient age and patient status?
- Is there any relationship between patient age and patient gender?
For this analysis, we used IBM SPSS[9] software.
Dataset
- The dataset has been taken from Kaggle.[8]
- There are a total of 17 attributes in the dataset.
- Except for age, all attribute data types are strings.
- In SPSS, we cannot perform any type of analysis on the string datatype.
- Therefore, we replace the value of gender, transmission type and current status with nominal data.
Table 1. Change of String value into Nominal
|
Label
|
Value
|
Gender
|
Male
|
1
|
Female
|
2
|
Current Status
|
Recovered
|
1
|
Hospitalized
|
2
|
Deceased
|
3
|
Age data are available in integer format, but the value of age ranges between 0 and 100, so it is very difficult to visualize such data. We also divided this attribute into categories and made a new age attribute.
Table 2. Change of Age value into age group and Nominal
Age Range
|
Age Group
|
Value
|
0 to 18
|
< 18
|
1
|
19 to 40
|
19 - 40
|
2
|
41 to 65
|
41 - 65
|
3
|
Greater than 65
|
> 65
|
4
|
After filtering the data state wise. We checked the data for missing values. There were a total of 875 cases, out of which 174 cases had missing age and gender values. In this analysis, we removed these values. To remove missing values first, we check in which date range we have less missing value. After visualizing the data, we found that from 09 March 2020 to 27 April 2020, there is very less missing value. Table 4 shows that after filtering data with the date range, we have only 2 missing values.
Table 3. Total cases in Karnataka
|
Frequency
|
Percent
|
Valid
|
Male
|
464
|
53.0
|
Female
|
237
|
27.1
|
Total
|
701
|
80.1
|
Missing
|
|
174
|
19.9
|
Total
|
875
|
100.0
|
Table 4. Total cases remaining after filtering data with date
|
Frequency
|
Percent
|
Valid
|
Male
|
362
|
70.7
|
Female
|
148
|
28.9
|
Total
|
510
|
99.6
|
Missing
|
|
2
|
.4
|
Total
|
512
|
100.0
|
In SPSS, there are many useful commands that can be used to handle missing value. To remove missing value in gender, we use the following command:
(gender = 1 or gender = 2)
This command selects only those rows where we have gender value either 1 or 2 and all the other rows remain unselected.
Table 5. Final dataset used for analysis
|
Frequency
|
Percent
|
Valid
|
Male
|
362
|
71.0
|
Female
|
148
|
29.0
|
Total
|
510
|
100.0
|
Table 5 shows statistics after removing all the missing values. Fig. 1. Shows the pie chart of male and female cases.
Table 6 provides information related to current status attributes. There are no missing value attributes. Fig. 2. Bar chart for current status and it can be clearly seen from the bar chart that the majority of cases are hospitalized.
Valid and missing values in the current status
Current Status
|
N
|
Valid
|
510
|
Missing
|
0
|
Table 7 provides details of the age value in the dataset. Fig. 3 shows the histogram, mean age and standard deviation for age. Fig. 4 and Fig. 5 show histograms for males and females, respectively.
Table 7. Valid and missing values in Age Bracket
Age
|
N
|
Valid
|
510
|
Missing
|
0
|
Cases according to age group, current status and gender are represented in graphical form in Fig. 6.
To solve the research question, we perform a chi square test. This test is used when we are dealing with nominal or ordinal data and want to find the relationship between two variables.